finnish_media_scrapers package¶
Subpackages¶
- finnish_media_scrapers.scripts package
- Submodules
- finnish_media_scrapers.scripts.fetch_hs module
- finnish_media_scrapers.scripts.fetch_open module
- finnish_media_scrapers.scripts.htmltotext_hs module
- finnish_media_scrapers.scripts.htmltotext_il module
- finnish_media_scrapers.scripts.htmltotext_is module
- finnish_media_scrapers.scripts.htmltotext_svyle module
- finnish_media_scrapers.scripts.htmltotext_yle module
- finnish_media_scrapers.scripts.post_filter module
- finnish_media_scrapers.scripts.query_hs module
- finnish_media_scrapers.scripts.query_il module
- finnish_media_scrapers.scripts.query_is module
- finnish_media_scrapers.scripts.query_yle module
- Module contents
Submodules¶
finnish_media_scrapers.fetch module¶
Utilities for fetching articles. Currently only affects Helsingin Sanomat. Scraping of the other sources can be done just using requests, but HS needs a user to be logged in, as well as renders their articles using dynamic javascript, thus requiring a Selenium session to enable fetching the articles.
- async finnish_media_scrapers.fetch.fetch_article_hs(session: pyppeteer.page.Page, url: str, max_web_driver_wait: int = 30) str[source]¶
Fetch the HTML of a single article using a pyppeteer session where prepare_session_hs has been called before.
- Parameters
session (Page) – the pyppeteer session to use
url (str) – the HS article URL to fetch article content from
max_web_driver_wait (int) – the maximum number of seconds to wait for the webdriver to render a page before failing (default: 30)
- Raises
ValueError – If parsing the article fails, probably due to encountering a prevously unknown layout
- Returns
the HTML of the article
- Return type
str
- async finnish_media_scrapers.fetch.prepare_session_hs(session: pyppeteer.page.Page, username: str, password: str, max_web_driver_wait: int = 30)[source]¶
Prepare a pyppeteer session for scraping articles from Helsingin Sanomat by logging in using the provided user id and password.
- Raises
TimeoutError – if the web driver is unable to find the elements it is looking for in 30 seconds. May indicate changes to the loging page structure.
- Parameters
session (Page) – the pyppeteer session to use
username (str) – the username to log in as
password (str) – the password to use for logging in
max_web_driver_wait (int) – the maximum number of seconds to wait for the webdriver to render a page before failing (default: 30)
finnish_media_scrapers.htmltotext module¶
Functions to extract article plain texts from the YLE/HS/IL/IS HTML articles
- finnish_media_scrapers.htmltotext.extract_text_from_hs_html(html: Union[str, TextIO]) str[source]¶
Extract article text from Helsingin Sanomat article HTML
- Parameters
html (Union[str,TextIO]) – a string or a file-like object containing the article HTML
- Raises
ValueError – The layout of the article was not recognized, or the article parsed as empty
- Returns
article text
- Return type
str
- finnish_media_scrapers.htmltotext.extract_text_from_il_html(html: Union[str, TextIO]) str[source]¶
Extract article text from Iltalehti article HTML
- Parameters
html (Union[str,TextIO]) – a string or a file-like object containing the article HTML
- Raises
ValueError – The layout of the article was not recognized, or the article parsed as empty
- Returns
article text
- Return type
str
- finnish_media_scrapers.htmltotext.extract_text_from_is_html(html: Union[str, TextIO]) str[source]¶
Extract article text from Ilta-Sanomat article HTML
- Parameters
html (Union[str,TextIO]) – a string or a file-like object containing the article HTML
- Raises
ValueError – The layout of the article was not recognized, or the article parsed as empty
- Returns
article text
- Return type
str
- finnish_media_scrapers.htmltotext.extract_text_from_svyle_html(html: Union[str, TextIO]) str[source]¶
Extract article text from Svenska YLE article HTML
- Parameters
html (Union[str,TextIO]) – a string or a file-like object containing the article HTML
- Raises
ValueError – The layout of the article was not recognized, or the article parsed as empty
- Returns
article text
- Return type
str
- finnish_media_scrapers.htmltotext.extract_text_from_yle_html(html: Union[str, TextIO]) str[source]¶
Extract article text from YLE article HTML
- Parameters
html (Union[str,TextIO]) – a string or a file-like object containing the article HTML
- Raises
ValueError – The layout of the article was not recognized, or the article parsed as empty
- Returns
article text
- Return type
str
finnish_media_scrapers.query module¶
Functions related to querying articles from the apis of YLE, Helsingin Sanomat (HS), Ilta-Sanomat (IS) and Iltalehti (IL)
- class finnish_media_scrapers.query.Article(id: str, url: str, title: str, date_modified: str)[source]¶
Bases:
objectAn article
- id¶
the unique id for the article
- Type
str
- url¶
the url from which the article may be found
- Type
str
- title¶
the title or headline of the article
- Type
str
- date_modified¶
the date of last modification for the article
- Type
str
- class finnish_media_scrapers.query.Result(articles: list[Article], url: str, total: int = - 1)[source]¶
Bases:
objectA result from a single API call
- url¶
the URL of the API query
- Type
str
- total¶
the total number of articles for the query. -1 if not available.
- Type
int
- async finnish_media_scrapers.query.query_hs(session: aiohttp.client.ClientSession, query: str, from_date: str, to_date: str, batch_size: int = 100) AsyncIterable[finnish_media_scrapers.query.Result][source]¶
Query the HS API for articles matching a query
- Parameters
session (ClientSession) – the aiohttp session to use
query (str) – the query string to search for
from_date (str) – date to search from (inclusive, YYYY-MM-DD)
to_date (str) – date to search to (inclusive, YYYY-MM-DD)
batch_size (int, optional) – How many entries to query for in a single API call. Values supported by the HS API are 50 and 100 (which is the default).
- Raises
ValueError – when something goes wrong in the API call
- Yields
AsyncIterable[Result] – each Result contains the results from a single API call
- async finnish_media_scrapers.query.query_il(session: aiohttp.client.ClientSession, query: str, from_date: str, to_date: str, batch_size: int = 200) AsyncIterable[finnish_media_scrapers.query.Result][source]¶
Query the IL API for articles matching a query
- Parameters
session (ClientSession) – the aiohttp session to use
query (str) – the query string to search for
from_date (str) – date to search from (inclusive, YYYY-MM-DD)
to_date (str) – date to search to (inclusive, YYYY-MM-DD)
batch_size (int, optional) – How many entries to query for in a single API call. Maximum and default for the IL API is 200.
- Raises
ValueError – when something goes wrong in the API call
- Yields
AsyncIterable[Result] – each Result contains the results from a single API call
- async finnish_media_scrapers.query.query_is(session: aiohttp.client.ClientSession, query: str, from_date: str, to_date: str, batch_size: int = 100) AsyncIterable[finnish_media_scrapers.query.Result][source]¶
Query the IS API for articles matching a query
- Parameters
session (ClientSession) – the aiohttp session to use
query (str) – the query string to search for
from_date (str) – date to search from (inclusive, YYYY-MM-DD)
to_date (str) – date to search to (inclusive, YYYY-MM-DD)
batch_size (int, optional) – How many entries to query for in a single API call. Values supported by the IS API are 50 and 100 (which is the default).
- Raises
ValueError – when something goes wrong in the API call
- Yields
AsyncIterable[Result] – each Result contains the results from a single API call
- async finnish_media_scrapers.query.query_yle(session: aiohttp.client.ClientSession, query: str, language: str, from_date: str, to_date: str, batch_size: int = 10000) AsyncIterable[finnish_media_scrapers.query.Result][source]¶
Query the YLE API for articles matching a query
- Parameters
session (ClientSession) – the aiohttp session to use
query (str) – the query string to search for
language (str) – language to search (either ‘fi’ or ‘sv’)
from_date (str) – date to search from (inclusive, YYYY-MM-DD)
to_date (str) – date to search to (inclusive, YYYY-MM-DD)
batch_size (int, optional) – How many entries to query for in a single API call. Maximum and default for the YLE API is 10000.
- Raises
ValueError – when something goes wrong in the API call
- Yields
AsyncIterable[Result] – each Result contains the results from a single API call