Lemmas and Named Entities analysis in major media outlets regarding Switzerland and Covid-19

doi:10.5281/zenodo.4036071

Published September 18, 2020 | Version v1

Dataset Open

Lemmas and Named Entities analysis in major media outlets regarding Switzerland and Covid-19

Giovanni Spitale¹

1. University of Zurich - Institute of Biomedical Ethics and History of Medicine

Project leader:

Nikola Biller-Andorno¹

Project member:

Sonja Merten²

1. University of Zurich - Institute of Biomedical Ethics and History of Medicine
2. Swiss Tropical and Public Health Institute

The COVID-19 pandemic generated (and keeps generating) a huge corpus of news articles, easily retrievable in Factiva with very targeted queries.

This dataset, generated with an ad-hoc parser and NLP pipeline, analyzes the frequency of lemmas and named entities in news articles (in German, French, Italian and English ) regarding Switzerland and COVID-19.

The analysis of large bodies of grey literature via text mining and computational linguistics is an increasingly frequent approach to understand the large-scale trends of specific topics. We used Factiva, a news monitoring and search engine developed and owned by Dow Jones, to gather and download all the news articles published between January and July 2020 on Covid-19 and Switzerland.

Due to Factiva's copyright policy, it is not possible to share the original dataset with the exports of the articles' text; however, we can share the results of our work on the corpus. All the information relevant to reproduce the results is provided.

Factiva allows a very granular definition of the queries, and moreover has access to full text articles published by the major media outlet of the world. The query has been defined as follows (syntax in bold, explanation in italics):

((coronavirus or Wuhan virus or corvid19 or corvid 19 or covid19 or covid 19 or ncov or novel coronavirus or sars) and (atleast3 coronavirus or atleast3 wuhan or atleast3 corvid* or atleast3 covid* or atleast3 ncov or atleast3 novel or atleast3 corona*))

Keywords for covid19; must appear at least 3 times in the text

and ns=(gsars or gout)

Subject is “novel coronaviruses” or “outbreaks and epidemics” and “general news”

and la=X

Language is X (DE, FR, IT, EN)

and rst=tmnb

Restrict to TMNB (major news and business publications)

and wc>300

At least 300 words

and date from 20191001 to 20200801

Date interval

and re=SWITZ

Region is Switzerland

It is important to specify some details that characterize the query.
The query is not limited to articles published by Swiss media, but to articles regarding Switzerland. The reason is simple: a Swiss user googling for “Schweiz Coronavirus” or for “Coronavirus Ticino” can easily find and read articles published by foreign media outlets (namely, German or Italian) on that topic. If the objective is capturing and describing the information trends to which people are exposed, this approach makes much more sense than limiting the analysis to articles published by Swiss media.
Factiva’s field “NS” is a descriptor for the content of the article. “gsars” is defined in Factiva’s documentation as “All news on Severe Acute Respiratory Syndrome”, and “gout” as “The widespread occurrence of an infectious disease affecting many people or animals in a given population at the same time”; however, the way these descriptors are assigned to articles is not specified in the documentation.

Finally, the query has been restricted to major news and business publications of at least 300 words. Duplicate check is performed by Factiva. Given the incredibly large amount of articles published on COVID-19, this (absolutely arbitrary) restriction allows retrieving a corpus that is both meaningful and manageable.

metadata.xlsx contains information about the articles retrieved (strategy, amount)

The PDF files document the execution of the Jupyter notebooks.

The zip file contains the lemma and NE frequencies data, divided by language. The "Lemmas" folder contains a CSV file per month and a general timeseries; the "Entities" folder contains a CSV file per month, a general timeseries, plus subsets that are category-specific. For a comprehensive explanation about categories, you can check the PDF files.

This work is part of the PubliCo research project.

Notes

This work is part of the PubliCo research project, supported by the Swiss National Science Foundation (SNF). Project no. 31CA30_195905

Files

de-NLP.pdf

Files (8.9 MB)

Name	Size	Download all
de-NLP.pdf md5:fbeea33673a0bdf5bfff6677f462647c	177.7 kB	Preview Download
en-NLP.pdf md5:3b4a70c5b6aa73c77f973c05f9e0e781	179.3 kB	Preview Download
exports_NLP.zip md5:50f44883664f66b9fcada187d61f3aaa	8.2 MB	Preview Download
fr-NLP.pdf md5:182582bf5ab234f335f1f8864c9d9340	177.1 kB	Preview Download
it-NLP.pdf md5:906cacc7bac7238025373f45af35c030	175.9 kB	Preview Download
metadata.xlsx md5:a7312c47c652b6f69e1ddeaea4cf0097	11.6 kB	Download

Additional details

Is compiled by: Software: 10.5281/zenodo.3991821 (DOI)

	All versions	This version
Views	1,600	925
Downloads	467	409
Data volume	323.2 MB	149.6 MB

Lemmas and Named Entities analysis in major media outlets regarding Switzerland and Covid-19

Creators

Contributors

Project leader:

Project member:

Description

Notes

Files

de-NLP.pdf

Files (8.9 MB)

Additional details

Related works