European Multilingual News Articles Dataset with Topic Annotation

Morini, Virginia; Bellomo, Lorenzo; Rossetti, Giulio; Pedreschi, Dino; Ferragina, Paolo

doi:10.5281/zenodo.10397400

Published December 17, 2023 | Version Version v1

Dataset Restricted

European Multilingual News Articles Dataset with Topic Annotation

1. University of Pisa
2. National Research Council
3. Scuola Normale Superiore

The European Multilingual News Articles Dataset is composed of over 18 million European news articles coming from 205 media outlets belonging to 27 European countries (i.e., all EU countries belonging to the European Union) with the addition of the United Kingdom. Articles range in a time period from 2017 to 2021 and are written in their original languages, for a total of 23 different languages included.

After selecting reliable, nationwide European media outlets, each article (i.e., title, textual content, URL, and date and time of publication) was extracted from the Common Crawl News Corpus, which contains petabytes of raw web page data collected since 2016. The dataset is released without any text pre-processing other than a cleanup of XML tags. Further, we enriched it by adding several media metadata (e.g., frequency of publication, distribution area, language, type of media).

Moreover, we enhanced the dataset by adding - whenever possible - article-level topic annotation by using articles' URLs as a proxy of the topic discussed. In the end, we were able to assign a topic to over 4 million articles (33 unique topics, e.g., politics, sport, entertainment), thus 23.2% of the entire dataset. Further, from URLs, we also extract the types of over 4 million articles (15 unique article types, e.g., news, international, multimedia).

Files

Restricted

The record is publicly accessible, but files are restricted. <a href="https://zenodo.org/account/settings/login?next=https://zenodo.org/records/10397400">Log in</a> to check if you have access.

Additional details

Submitted: 2023-12-17

	All versions	This version
Views	296	296
Downloads	22	21
Data volume	22.1 GB	22.1 GB

European Multilingual News Articles Dataset with Topic Annotation

Authors/Creators

Description

Files

Restricted

Additional details

Dates