Published July 16, 2021 | Version v1
Dataset Open

Data for manuscript "Prevalence in News Media of two Competing Hypotheses about COVID-19 Origins"

Creators

Description

The Covid-19 pandemic has been one of the most disruptive and painful phenomena of the last few decades. As of July 2021, the origins of the SARS-CoV-2 virus that caused the outbreak remain a mystery. This work analyzes the prevalence in news media articles of two popular hypotheses about SARS-CoV-2 virus origins: the natural emergence and the lab-leak hypotheses. 

This data set contains frequency counts of target words in news and opinion articles from 12 popular news media outlets. The target words are listed in the associated manuscript and are mostly words associated with the Covid-19 pandemic. 

The list of compressed files in this data set is listed next:

targetWordsInArticlesCounts.rar contains counts of target words in outlets articles as well as total counts of words in articles

targetWordsFrequencies.rar daily, weekly, monthly word frequencies

wordEmbeddingModels.rar monthly embedding models of news outlets content

analysisScripts.rar analysis notebooks

The textual content of news and opinion articles from the outlets is available in the outlet's online domains and/or public cache repositories such as Google cache, The Internet Wayback Machine, and Common Crawl. We used derived word frequency counts from these sources. Textual content included in our analysis is circumscribed to articles headlines and main body of text of the articles and does not include other article elements such as figure captions.

Targeted textual content was located in HTML raw data using outlet specific XPath expressions. Tokens were lowercased prior to estimating frequency counts. 

Yearly frequency usage of a target word in an outlet in any given temporal interval ( daily, weekly, monthly) was estimated by dividing the total number of occurrences of the target word in all articles of a given temporal interval by the number of all words in all articles of that temporal interval. This method of estimating frequency accounts for variable volume of total article output over time.

In a small percentage of articles, outlet specific XPath expressions might fail to properly capture the content of the article due to the heterogeneity of HTML elements and CSS styling combinations with which articles text content is arranged in outlets online domains. As a result, the total and target word counts metrics for a small subset of articles are not precise. In a random sample of articles and outlets, manual estimation of target words counts overlapped with the automatically derived counts for over 90% of the articles. Most of the incorrect frequency counts are minor deviations from the actual counts such as for instance counting a word in an article footnote encouraging article readers to find related articles and that the XPath expression might mistakenly include as the content of the article main text. Some additional outlet-specific inaccuracies that we could identify occurred in the WSJ where in less than 5% of the articles XPath expressions failed to capture the article's main text content. Other outlets articles samples sizes might not be comprehensive but, to the best of our knowledge, they are representative and include tens of thousands of articles per outlet/year. To conclude, in a data analysis of over 1.5 million articles, we cannot manually check the correctness of frequency counts for every single article and hundred percent accuracy at capturing articles’ content is elusive due to the small number of difficult to detect boundary cases such as incorrect HTML markup syntax in online domains. Overall however, we are confident that our frequency metrics are representative of word prevalence in print news media content (see Figure 1 of main manuscript for supporting evidence).

 

 

Files

Files (2.2 GB)

Name Size Download all
md5:a72a134fa1b6e90e13b54073ca3985f1
8.9 kB Download
md5:47adabae8d8fe8532d55d4020f6d616b
1.5 MB Download
md5:86817a19522f1f136d93e3b854327a5a
59.2 MB Download
md5:3dd21a735ec5ab4d6258d2e7035589f7
2.1 GB Download