NewsWords Data (Word Counts)
Authors/Creators
Description
NewsWords
Word Counts from the British Library's Digitised Newspaper Collections
Description
The NewsWords dataset contains word count data derived from newspapers published in Britain during the "long" nineteenth century (1780-1920) and digitised as of 2025. These frequencies are computed from the British Library's collection.
The tar file contains 269,179 JSON files. Each file captures the word counts for one month for one newspaper title. The filenames are structured as follows: "{newspaper_id}_{year}_{month}.json", e.g. "0003281_1896_07.json".
Each file consists of a dictionary mapping words to their frequencies, e.g. {"newspaper": 19, "transmission": 11}.
Together, the word counts represent a corpus of 120 billion tokens based on a vocabulary of 200k unique words appearing more than five times. Please follow this link to view a bar chart that breaks down the word counts by decade.
The newspaper_id corresponds with NLP ids, which are documented in the British Library newspaper catalogue:
> Ryan, Yann, and Luke McKernan. 2021. “Converting the British Library’s Catalogue of British and Irish Newspapers into a Public Domain Dataset: Processes and Applications”. *Journal of Open Humanities Data* 7 (0): 1. https://doi.org/10.5334/johd.23.
Complete metadata for this newspaper collection is available in another open dataset:
> Westerling, Kalle, Timothy Hobson, Kaspar Beelen, Nilo Pedrazzini, Daniel Wilson, and Katherine McDonough. “Lwmdb Data”. *Zenodo*, December 11, 2024. https://doi.org/10.5281/zenodo.14389180.
Code
The NewsWords Code GitHub repository provides code for converting "raw" word counts to a more manageable sparse matrix format and contextualises these counts with additional newspaper metadata, e.g. information about price and politics. Further Information about how to use the code and query the NewsWords data is available in the GitHub README.
To recreate these sparse matrices, please follow the instructions in "Create_sparse_matrices.ipynb"
Limitations
These word counts are derived from the digitised press, containing billions of words, spanning multiple decades. However large, these data constitute around ##% of the number of newspaper titles that circulated in Great Britain. In our paper "Whose News? Critical methods for assessing bias in large historical datasets" (under review) we have tackled the issue of representativeness, and point out that these exhibit some partisan bias—in the sense that they overrepresent conservative and liberal newspaper titles—which varies over the 19th century.
> "Whose News? Critical methods for assessing bias in
large historical datasets" (under review)
For more information about the method and data see also:
> Beelen, Kaspar, Jon Lawrence, Daniel C Wilson, and David Beavan, 2023. 'Bias and representativeness in digitized newspaper collections: Introducing the environmental scan.' *Digital Scholarship in the Humanities*, 38(1), pp.1-22.
Files
Files
(76.0 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:468eca81cb68d6322598c88f737e9315
|
76.0 GB | Download |
Additional details
Funding
- Arts and Humanities Research Council
- Living with Machines AH/S01179X/1
Dates
- Created
-
2019-01-01