There is a newer version of this record available.

Dataset Open Access

Lemmas and Named Entities analysis in major media outlets regarding Switzerland and Covid-19

Giovanni Spitale

Project leader(s)
Nikola Biller-Andorno
Project member(s)
Sonja Merten

The COVID-19 pandemic generated (and keeps generating) a huge corpus of news articles, easily retrievable in Factiva with very targeted queries.

This dataset, generated with an ad-hoc parser and NLP pipeline, analyzes the frequency of lemmas and named entities in news articles (in German, French, Italian and English ) regarding Switzerland and COVID-19. 

The analysis of large bodies of grey literature via text mining and computational linguistics is an increasingly frequent approach to understand the large-scale trends of specific topics. We used Factiva, a news monitoring and search engine developed and owned by Dow Jones, to gather and download all the news articles published between January and July 2020 on Covid-19 and Switzerland.

Due to Factiva's copyright policy, it is not possible to share the original dataset with the exports of the articles' text; however, we can share the results of our work on the corpus. All the information relevant to reproduce the results is provided.

Factiva allows a very granular definition of the queries, and moreover has access to full text articles published by the major media outlet of the world. The query has been defined as follows (syntax in bold, explanation in italics):

 

((coronavirus or Wuhan virus or corvid19 or corvid 19 or covid19 or covid 19 or ncov or novel coronavirus or sars) and (atleast3 coronavirus or atleast3 wuhan or atleast3 corvid* or atleast3 covid* or atleast3 ncov or atleast3 novel or atleast3 corona*))

Keywords for covid19; must appear at least 3 times in the text

and ns=(gsars or gout)

Subject is “novel coronaviruses” or “outbreaks and epidemics” and “general news”

and la=X

Language is X (DE, FR, IT, EN)

and rst=tmnb

Restrict to TMNB (major news and business publications)

and wc>300

At least 300 words

and date from 20191001 to 20200801

Date interval

and re=SWITZ

Region is Switzerland

 

It is important to specify some details that characterize the query. 
The query is not limited to articles published by Swiss media, but to articles regarding Switzerland. The reason is simple: a Swiss user googling for “Schweiz Coronavirus” or for “Coronavirus Ticino” can easily find and read articles published by foreign media outlets (namely, German or Italian) on that topic. If the objective is capturing and describing the information trends to which people are exposed, this approach makes much more sense than limiting the analysis to articles published by Swiss media.
Factiva’s field “NS” is a descriptor for the content of the article. “gsars” is defined in Factiva’s documentation as “All news on Severe Acute Respiratory Syndrome”, and “gout” as “The widespread occurrence of an infectious disease affecting many people or animals in a given population at the same time”; however, the way these descriptors are assigned to articles is not specified in the documentation.

Finally, the query has been restricted to major news and business publications of at least 300 words. Duplicate check is performed by Factiva. Given the incredibly large amount of articles published on COVID-19, this (absolutely arbitrary) restriction allows retrieving a corpus that is both meaningful and manageable.

metadata.xlsx contains information about the articles retrieved (strategy, amount)

The PDF files document the execution of the Jupyter notebooks. 

The zip file contains the lemma and NE frequencies data, divided by language. The "Lemmas" folder contains a CSV file per month and a general timeseries; the "Entities" folder contains a CSV file per month, a general timeseries, plus subsets that are category-specific. For a comprehensive explanation about categories, you can check the PDF files.

 

This work is part of the PubliCo research project.

This work is part of the PubliCo research project, supported by the Swiss National Science Foundation (SNF). Project no. 31CA30_195905
Files (8.9 MB)
Name Size
de-NLP.pdf
md5:fbeea33673a0bdf5bfff6677f462647c
177.7 kB Download
en-NLP.pdf
md5:3b4a70c5b6aa73c77f973c05f9e0e781
179.3 kB Download
exports_NLP.zip
md5:50f44883664f66b9fcada187d61f3aaa
8.2 MB Download
fr-NLP.pdf
md5:182582bf5ab234f335f1f8864c9d9340
177.1 kB Download
it-NLP.pdf
md5:906cacc7bac7238025373f45af35c030
175.9 kB Download
metadata.xlsx
md5:a7312c47c652b6f69e1ddeaea4cf0097
11.6 kB Download
864
222
views
downloads
All versions This version
Views 864628
Downloads 222215
Data volume 66.8 MB52.0 MB
Unique views 844613
Unique downloads 200194

Share

Cite as