Published November 18, 2021 | Version v1
Dataset Open

Data for manuscript "Reciprocal Radicalization: The Rise of Culture War Terminology in British and American News Coverage"

Creators

  • 1. Otago Polytechnic

Description

This data set contains frequency counts of target words in 16 million news and opinion articles from 10 popular news media outlets in the United Kingdom: The Guardian, The Times, The Independent, The Daily Mirror, BBC, Financial Times, Metro, Telegraph, The and The Daily Mail plus a few additional American-based outlets used for comparison reference. The target words are listed in the associated manuscript and are mostly words that denote some type of prejudice, social justice related terms or counterreaction to it. A few additional words are also available since they are used in the manuscript for illustration purposes.

The textual content of news and opinion articles from the outlets listed in Figure 3 of the main manuscript is available in the outlet's online domains and/or public cache repositories such as Google cache (https://webcache.googleusercontent.com), The Internet Wayback Machine (https://archive.org/web/web.php), and Common Crawl (https://commoncrawl.org). We derived relative frequency counts from these sources. Textual content included in our analysis is circumscribed to articles headlines and main body of text of the articles and does not include other article elements such as figure captions.

Targeted textual content was located in HTML raw data using outlet specific xpath expressions. Tokens were lowercased prior to estimating frequency counts. To prevent outlets with sparse text content for a year from distorting aggregate frequency counts, we only include outlet frequency counts from years for which there is at least 1 million words of article content from an outlet. 

Yearly frequency usage of a target word in an outlet in any given year was estimated by dividing the total number of occurrences of the target word in all articles of a given year by the number of all words in all articles of that year. This method of estimating frequency accounts for variable volume of total article output over time.

The list of compressed files in this data set is listed next:

-analysisScripts.rar contains the analysis scripts used in the main manuscript 

-targetWordsInArticlesCounts.rar contains counts of target words in outlets articles as well as total counts of words in articles

-targetWordsInArticlesCountsGuardianExampleWords contains counts of target words in outlets articles as well as total counts of words in articles for illustrative Figure 1 in main manuscript

Usage Notes

In a small percentage of articles, outlet specific XPath expressions can fail to properly capture the content of the article due to the heterogeneity of HTML elements and CSS styling combinations with which articles text content is arranged in outlets online domains. As a result, the total and target word counts metrics for a small subset of articles are not precise. In a random sample of articles and outlets, manual estimation of target words counts overlapped with the automatically derived counts for over 90% of the articles.

Most of the incorrect frequency counts were minor deviations from the actual counts such as for instance counting the word "Facebook" in an article footnote encouraging article readers to follow the journalist’s Facebook profile and that the XPath expression mistakenly included as the content of the article main text. To conclude, in a data analysis of 16 million articles, we cannot manually check the correctness of frequency counts for every single article and hundred percent accuracy at capturing articles’ content is elusive due to the small number of difficult to detect boundary cases such as incorrect HTML markup syntax in online domains. Overall however, we are confident that our frequency metrics are representative of word prevalence in print news media content (see Figure 1 of main manuscript for supporting evidence).

Files

Files (625.8 MB)

Name Size Download all
md5:c7215494a49fb4838dc566c03fac5f4d
4.5 MB Download
md5:1c7e8959f1918d1f94bee8eba471e8a1
588.0 MB Download
md5:553546f5ffced61ab4219d4ba4824a89
33.1 MB Download
md5:330b893f7e965b0a1812fc141bc0cbbf
176.3 kB Download
md5:11772f363618ca1f062efb7532c09295
3.9 kB Download