Dataset Open Access
The COVID-19 pandemic generated (and keeps generating) a huge corpus of news articles, easily retrievable in Factiva with very targeted queries.
This dataset, generated with an ad-hoc parser and NLP pipeline, analyzes the frequency of lemmas and named entities in news articles (in German, French, Italian and English ) regarding Switzerland and COVID-19.
The analysis of large bodies of grey literature via text mining and computational linguistics is an increasingly frequent approach to understand the large-scale trends of specific topics. We used Factiva, a news monitoring and search engine developed and owned by Dow Jones, to gather and download all the news articles published between January and July 2020 on Covid-19 and Switzerland.
Due to Factiva's copyright policy, it is not possible to share the original dataset with the exports of the articles' text; however, we can share the results of our work on the corpus. All the information relevant to reproduce the results is provided.
Factiva allows a very granular definition of the queries, and moreover has access to full text articles published by the major media outlet of the world. The query has been defined as follows (syntax in bold, explanation in italics):
((coronavirus or Wuhan virus or corvid19 or corvid 19 or covid19 or covid 19 or ncov or novel coronavirus or sars) and (atleast3 coronavirus or atleast3 wuhan or atleast3 corvid* or atleast3 covid* or atleast3 ncov or atleast3 novel or atleast3 corona*))
Keywords for covid19; must appear at least 3 times in the text
and ns=(gsars or gout)
Subject is “novel coronaviruses” or “outbreaks and epidemics” and “general news”
Language is X (DE, FR, IT, EN)
Restrict to TMNB (major news and business publications)
At least 300 words
and date from 20191001 to 20200801
Region is Switzerland
It is important to specify some details that characterize the query.
The query is not limited to articles published by Swiss media, but to articles regarding Switzerland. The reason is simple: a Swiss user googling for “Schweiz Coronavirus” or for “Coronavirus Ticino” can easily find and read articles published by foreign media outlets (namely, German or Italian) on that topic. If the objective is capturing and describing the information trends to which people are exposed, this approach makes much more sense than limiting the analysis to articles published by Swiss media.
Factiva’s field “NS” is a descriptor for the content of the article. “gsars” is defined in Factiva’s documentation as “All news on Severe Acute Respiratory Syndrome”, and “gout” as “The widespread occurrence of an infectious disease affecting many people or animals in a given population at the same time”; however, the way these descriptors are assigned to articles is not specified in the documentation.
Finally, the query has been restricted to major news and business publications of at least 300 words. Duplicate check is performed by Factiva. Given the incredibly large amount of articles published on COVID-19, this (absolutely arbitrary) restriction allows retrieving a corpus that is both meaningful and manageable.
metadata.xlsx contains information about the articles retrieved (strategy, amount)
The PDF files document the execution of the Jupyter notebooks.
The zip file contains the lemma and NE frequencies data, divided by language. The "Lemmas" folder contains a CSV file per month and a general timeseries; the "Entities" folder contains a CSV file per month, a general timeseries, plus subsets that are category-specific. For a comprehensive explanation about categories, you can check the PDF files.
This work is part of the PubliCo research project.