finnish_media_scrapers package

Submodules

finnish_media_scrapers.fetch module

Utilities for fetching articles. Currently only affects Helsingin Sanomat. Scraping of the other sources can be done just using requests, but HS needs a user to be logged in, as well as renders their articles using dynamic javascript, thus requiring a Selenium session to enable fetching the articles.

async finnish_media_scrapers.fetch.fetch_article_hs(session: pyppeteer.page.Page, url: str, max_web_driver_wait: int = 30) str[source]

Fetch the HTML of a single article using a pyppeteer session where prepare_session_hs has been called before.

Parameters
  • session (Page) – the pyppeteer session to use

  • url (str) – the HS article URL to fetch article content from

  • max_web_driver_wait (int) – the maximum number of seconds to wait for the webdriver to render a page before failing (default: 30)

Raises

ValueError – If parsing the article fails, probably due to encountering a prevously unknown layout

Returns

the HTML of the article

Return type

str

async finnish_media_scrapers.fetch.prepare_session_hs(session: pyppeteer.page.Page, username: str, password: str, max_web_driver_wait: int = 30)[source]

Prepare a pyppeteer session for scraping articles from Helsingin Sanomat by logging in using the provided user id and password.

Raises

TimeoutError – if the web driver is unable to find the elements it is looking for in 30 seconds. May indicate changes to the loging page structure.

Parameters
  • session (Page) – the pyppeteer session to use

  • username (str) – the username to log in as

  • password (str) – the password to use for logging in

  • max_web_driver_wait (int) – the maximum number of seconds to wait for the webdriver to render a page before failing (default: 30)

finnish_media_scrapers.htmltotext module

Functions to extract article plain texts from the YLE/HS/IL/IS HTML articles

finnish_media_scrapers.htmltotext.extract_text_from_hs_html(html: Union[str, TextIO]) str[source]

Extract article text from Helsingin Sanomat article HTML

Parameters

html (Union[str,TextIO]) – a string or a file-like object containing the article HTML

Raises

ValueError – The layout of the article was not recognized, or the article parsed as empty

Returns

article text

Return type

str

finnish_media_scrapers.htmltotext.extract_text_from_il_html(html: Union[str, TextIO]) str[source]

Extract article text from Iltalehti article HTML

Parameters

html (Union[str,TextIO]) – a string or a file-like object containing the article HTML

Raises

ValueError – The layout of the article was not recognized, or the article parsed as empty

Returns

article text

Return type

str

finnish_media_scrapers.htmltotext.extract_text_from_is_html(html: Union[str, TextIO]) str[source]

Extract article text from Ilta-Sanomat article HTML

Parameters

html (Union[str,TextIO]) – a string or a file-like object containing the article HTML

Raises

ValueError – The layout of the article was not recognized, or the article parsed as empty

Returns

article text

Return type

str

finnish_media_scrapers.htmltotext.extract_text_from_svyle_html(html: Union[str, TextIO]) str[source]

Extract article text from Svenska YLE article HTML

Parameters

html (Union[str,TextIO]) – a string or a file-like object containing the article HTML

Raises

ValueError – The layout of the article was not recognized, or the article parsed as empty

Returns

article text

Return type

str

finnish_media_scrapers.htmltotext.extract_text_from_yle_html(html: Union[str, TextIO]) str[source]

Extract article text from YLE article HTML

Parameters

html (Union[str,TextIO]) – a string or a file-like object containing the article HTML

Raises

ValueError – The layout of the article was not recognized, or the article parsed as empty

Returns

article text

Return type

str

finnish_media_scrapers.query module

Functions related to querying articles from the apis of YLE, Helsingin Sanomat (HS), Ilta-Sanomat (IS) and Iltalehti (IL)

class finnish_media_scrapers.query.Article(id: str, url: str, title: str, date_modified: str)[source]

Bases: object

An article

id

the unique id for the article

Type

str

url

the url from which the article may be found

Type

str

title

the title or headline of the article

Type

str

date_modified

the date of last modification for the article

Type

str

class finnish_media_scrapers.query.Result(articles: list[Article], url: str, total: int = - 1)[source]

Bases: object

A result from a single API call

articles

a list of the article objects returned

Type

list[Article]

url

the URL of the API query

Type

str

total

the total number of articles for the query. -1 if not available.

Type

int

async finnish_media_scrapers.query.query_hs(session: aiohttp.client.ClientSession, query: str, from_date: str, to_date: str, batch_size: int = 100) AsyncIterable[finnish_media_scrapers.query.Result][source]

Query the HS API for articles matching a query

Parameters
  • session (ClientSession) – the aiohttp session to use

  • query (str) – the query string to search for

  • from_date (str) – date to search from (inclusive, YYYY-MM-DD)

  • to_date (str) – date to search to (inclusive, YYYY-MM-DD)

  • batch_size (int, optional) – How many entries to query for in a single API call. Values supported by the HS API are 50 and 100 (which is the default).

Raises

ValueError – when something goes wrong in the API call

Yields

AsyncIterable[Result] – each Result contains the results from a single API call

async finnish_media_scrapers.query.query_il(session: aiohttp.client.ClientSession, query: str, from_date: str, to_date: str, batch_size: int = 200) AsyncIterable[finnish_media_scrapers.query.Result][source]

Query the IL API for articles matching a query

Parameters
  • session (ClientSession) – the aiohttp session to use

  • query (str) – the query string to search for

  • from_date (str) – date to search from (inclusive, YYYY-MM-DD)

  • to_date (str) – date to search to (inclusive, YYYY-MM-DD)

  • batch_size (int, optional) – How many entries to query for in a single API call. Maximum and default for the IL API is 200.

Raises

ValueError – when something goes wrong in the API call

Yields

AsyncIterable[Result] – each Result contains the results from a single API call

async finnish_media_scrapers.query.query_is(session: aiohttp.client.ClientSession, query: str, from_date: str, to_date: str, batch_size: int = 100) AsyncIterable[finnish_media_scrapers.query.Result][source]

Query the IS API for articles matching a query

Parameters
  • session (ClientSession) – the aiohttp session to use

  • query (str) – the query string to search for

  • from_date (str) – date to search from (inclusive, YYYY-MM-DD)

  • to_date (str) – date to search to (inclusive, YYYY-MM-DD)

  • batch_size (int, optional) – How many entries to query for in a single API call. Values supported by the IS API are 50 and 100 (which is the default).

Raises

ValueError – when something goes wrong in the API call

Yields

AsyncIterable[Result] – each Result contains the results from a single API call

async finnish_media_scrapers.query.query_yle(session: aiohttp.client.ClientSession, query: str, language: str, from_date: str, to_date: str, batch_size: int = 10000) AsyncIterable[finnish_media_scrapers.query.Result][source]

Query the YLE API for articles matching a query

Parameters
  • session (ClientSession) – the aiohttp session to use

  • query (str) – the query string to search for

  • language (str) – language to search (either ‘fi’ or ‘sv’)

  • from_date (str) – date to search from (inclusive, YYYY-MM-DD)

  • to_date (str) – date to search to (inclusive, YYYY-MM-DD)

  • batch_size (int, optional) – How many entries to query for in a single API call. Maximum and default for the YLE API is 10000.

Raises

ValueError – when something goes wrong in the API call

Yields

AsyncIterable[Result] – each Result contains the results from a single API call

Module contents