Published March 9, 2025 | Version v1
Dataset Open

NewsWords Data (Contextualized Word Counts)

  • 1. ROR icon School of Advanced Study

Description

NewsWords

Contextualized Word Counts from the British Library's Digitised Newspaper Collections (Sparse Matrix Format)

Description

The NewsWords dataset contains word count data derived from newspapers and newspaper press directories. The newspapers were mainly published in Britain during the long nineteenth century (1780-1920) and digitised as of 2025. These frequencies are computed from the digitised newspapers from the British Library's collection. 

The counts are "contextualized" by associating newspaper content with rich metadata obtained from Mitchell's Newspaper Press Directories (1846-1920). These reference works provide an almost exhaustive list of the newspapers that circulated in the UK, recording crucial metadata for each newspaper title, such as political leaning, price, circulation, and other information. 

More information on the directories and original data can be found on Zenodo or in the British Library research repository.

Together, the word counts represent a corpus of 117.222.226.919 tokens based on a vocabulary of 196.719 unique words. Please follow this link to view a bar chart that breaks down the word counts by decade.

Data Format

This dataset contains a processed version of the original word counts in JSON format (available here). To enhance the exploration of the newspaper content, especially for macro-level analysis, we converted the original frequencies to a sparse matrix format.

The Zenodo record contains two ".zip" files: sparse_matrices.zip and sparse-matrix.zip. Both contain the same information, but the latter file concatenates all data into one large matrix. Because loading all the data at once requires more than 128 GB of RAM, we primarily focus on the "distributed" version of the corpus, i.e. all the word counts are distributed over different files, one per newspaper title. To work with both versions of the data, please consult the NewsWords code library on GitHub (also there is more information below)

For each newspaper title (defined by the NLP identifier recording the British Library Catalogue) we produced the following files:

{NLP}_sparse_matrix.csv: is a Compressed Sparse Row sparse matrix of dtype 'float64'. The columns correspond with the vocabulary, the rows capture the monthly word counts.

{NLP}_metadata.csv: the rows in the metadata provide additional metadata for the word counts in the sparse matrix file. The first row in the metadata 0000031_metadata.csv file records context for the first row in the sparse matrix 0000031_sparse_matrix.npz.


mapping.json: maps the vocabulary to column index, i.e. {'!':0, 'a':1} indicates that the first column of the sparse matrix counts the number of exclamation marks.

metadata.csv: records contextual information for each newspaper title, it contains specific attributes such as politics, price and place of publication on a monthly basis. For more information about how the data was created and structured see also:

The NLPs are identifiers for digitised newspapers and are documented in the British Library newspaper catalogue:

> Ryan, Yann, and Luke McKernan. 2021. “Converting the British Library’s Catalogue of British and Irish Newspapers into a Public Domain Dataset: Processes and Applications”. *Journal of Open Humanities Data* 7 (0): 1. https://doi.org/10.5334/johd.23.

Complete metadata (including NLPs) for this newspaper collection is available in another open dataset:

> Westerling, Kalle, Timothy Hobson, Kaspar Beelen, Nilo Pedrazzini, Daniel Wilson, and Katherine McDonough. “Lwmdb Data”. *Zenodo*, December 11, 2024. https://doi.org/10.5281/zenodo.14389180.

Code

The NewsWords GitHub repository provides code for converting "raw" word counts to a more manageable sparse matrix format and contextualises these counts with additional newspaper metadata, e.g. information about price and politics. Further Information about how to use the code and query the NewsWords data is available in the GitHub README. 

To recreate these sparse matrices, please follow the instructions in "Create_sparse_matrices.ipynb"

The Notebook "Explore_Distributed_Corpus.ipynb" allows you to analyse the distributed corpus. To use the "merged" or "unified matrix", you find example code in "Explore_Merged_Corpus.ipynb".

Limitations

These word counts are derived from the digitised press, containing billions of words, spanning multiple decades. However large, the data constitutes around only 15% of the total number of newspaper titles that circulated in Great Britain. In our paper "Whose News? Critical methods for assessing bias in large historical datasets" (under review) we have tackled the issue of representativeness, and point out that these exhibit some partisan bias—in the sense that they overrepresent conservative and liberal newspaper titles—which varies over the nineteenth century. 

For more information about the method and data see also:

> Beelen, Kaspar, Jon Lawrence, Daniel C Wilson, and David Beavan, 2023. 'Bias and representativeness in digitized newspaper collections: Introducing the environmental scan.' *Digital Scholarship in the Humanities*, 38(1), pp.1-22.

> "Whose News? Critical methods for assessing bias in large historical datasets" (under review)

Files

sparse_matrices.zip

Files (33.2 GB)

Name Size Download all
md5:7fa328a0edd29de08c424fd5f9aecf37
17.1 GB Preview Download
md5:db70f5a8436428b69951650452b1db83
16.1 GB Preview Download

Additional details

Funding

Arts and Humanities Research Council
Living with Machines AH/S01179X/1

Software

Programming language
Python
Development Status
Active