NewsWords Data (Word Counts)

Beelen, Kaspar

doi:10.5281/zenodo.14826348

Published March 1, 2025 | Version v1

Dataset Open

NewsWords Data (Word Counts)

Beelen, Kaspar (Researcher)^{1, 2}

1. School of Advanced Study
2. The Alan Turing Institute

Contributors

Researcher:

Wilson, Daniel¹

1. The Alan Turing Institute

NewsWords

Word Counts from the British Library's Digitised Newspaper Collections

Description

The NewsWords dataset contains word count data derived from newspapers published in Britain during the "long" nineteenth century (1780-1920) and digitised as of 2025. These frequencies are computed from the British Library's collection.

The tar file contains 269,179 JSON files. Each file captures the word counts for one month for one newspaper title. The filenames are structured as follows: "{newspaper_id}_{year}_{month}.json", e.g. "0003281_1896_07.json".

Each file consists of a dictionary mapping words to their frequencies, e.g. {"newspaper": 19, "transmission": 11}.

Together, the word counts represent a corpus of 120 billion tokens based on a vocabulary of 200k unique words appearing more than five times. Please follow this link to view a bar chart that breaks down the word counts by decade.

The newspaper_id corresponds with NLP ids, which are documented in the British Library newspaper catalogue:

> Ryan, Yann, and Luke McKernan. 2021. “Converting the British Library’s Catalogue of British and Irish Newspapers into a Public Domain Dataset: Processes and Applications”. *Journal of Open Humanities Data* 7 (0): 1. https://doi.org/10.5334/johd.23.

Complete metadata for this newspaper collection is available in another open dataset:

> Westerling, Kalle, Timothy Hobson, Kaspar Beelen, Nilo Pedrazzini, Daniel Wilson, and Katherine McDonough. “Lwmdb Data”. *Zenodo*, December 11, 2024. https://doi.org/10.5281/zenodo.14389180.

Code

The NewsWords Code GitHub repository provides code for converting "raw" word counts to a more manageable sparse matrix format and contextualises these counts with additional newspaper metadata, e.g. information about price and politics. Further Information about how to use the code and query the NewsWords data is available in the GitHub README.

To recreate these sparse matrices, please follow the instructions in "Create_sparse_matrices.ipynb"

Limitations

These word counts are derived from the digitised press, containing billions of words, spanning multiple decades. However large, these data constitute around ##% of the number of newspaper titles that circulated in Great Britain. In our paper "Whose News? Critical methods for assessing bias in large historical datasets" (under review) we have tackled the issue of representativeness, and point out that these exhibit some partisan bias—in the sense that they overrepresent conservative and liberal newspaper titles—which varies over the 19th century.

> "Whose News? Critical methods for assessing bias in
large historical datasets" (under review)

For more information about the method and data see also:

> Beelen, Kaspar, Jon Lawrence, Daniel C Wilson, and David Beavan, 2023. 'Bias and representativeness in digitized newspaper collections: Introducing the environmental scan.' *Digital Scholarship in the Humanities*, 38(1), pp.1-22.

Files

Files (76.0 GB)

Name	Size	Download all
ngrams.tar.gz md5:468eca81cb68d6322598c88f737e9315	76.0 GB	Download

Additional details

Arts and Humanities Research Council
Living with Machines AH/S01179X/1

Created: 2019-01-01

	All versions	This version
Views	147	147
Downloads	107	107
Data volume	8.6 TB	8.6 TB

Contributors

Researcher:

NewsWords

Word Counts from the British Library's Digitised Newspaper Collections

Description

Code

Limitations

Files (76.0 GB)

Funding

Dates

NewsWords Data (Word Counts)

Authors/Creators

Contributors

Researcher:

Description

NewsWords

Word Counts from the British Library's Digitised Newspaper Collections

Description

Code

Limitations

Files

Files (76.0 GB)

Additional details

Funding

Dates