Published October 10, 2022 | Version v1
Dataset Open

Diachronic word embeddings from 19th-century newspapers digitised by the British Library (1800-1919)

  • 1. The Alan Turing Institute
  • 2. King's College London

Description

Word vectors related to the paper Machines in the media: semantic change in the lexicon of mechanization in 19th-century British newspapers by Nilo Pedrazzini and Barbara McGillivray (2022).

The embeddings were trained on a 4.2-billion-word corpus of 19th-century British newspapers using Word2Vec and the following parameters:

sg = True
min_count = 1
window = 3
vector_size = 200
epochs = 5

The embeddings are divided into periods of ten years each, with the vectors from each decade aligned to the ones from the most recent decade (1910s) using Orthogonal Procrustes.

See related GitHub repository for the full documentation: https://github.com/Living-with-machines/DiachronicEmb-BigHistData

Project webpage (Living with Machines): https://livingwithmachines.ac.uk/

Files

1800s-vectors.txt

Files (1.5 GB)

Name Size Download all
md5:1204d4a412a180418c0b410ca4ca4eee
160.5 MB Preview Download
md5:dd577a14b1117fb23ef9c1dd0c1b112e
155.0 MB Preview Download
md5:efbfde0120c138826337a2003f80c58a
152.4 MB Preview Download
md5:61a8b593770ca8034bf1e92bfaf7f6be
111.8 MB Preview Download
md5:0782d76c820d194f7a85152f800a2d9e
110.3 MB Preview Download
md5:55140907dae77c956b3f25a1ad9d617f
110.1 MB Preview Download
md5:49c9dead811376551c9b11c51ea9898e
110.2 MB Preview Download
md5:828b099ea3701a14cddd92e5227ea038
110.1 MB Preview Download
md5:029a6fc43ed58ca66cbab25e0ee97023
110.1 MB Preview Download
md5:39b5d0e4c87b3ba0c1e6a53ae0ff6c8a
110.1 MB Preview Download
md5:12233553e0832fef20e04fd95b17ccc7
110.3 MB Preview Download
md5:45267014edb7c22358fe484cedce7bb6
185.3 MB Preview Download

Additional details

Related works

Continues
Dataset: 10.5281/zenodo.7887305 (DOI)

Funding

Living with Machines AH/S01179X/1
UK Research and Innovation

References

  • Pedrazzini, Nilo and Barbara McGillivray. 2022. Machines in the media: semantic change in the lexicon of mechanization in 19th-century British newspapers. In Proceedings of the 2nd International Workshop on Natural Language Processing for Digital Humanities, pages 85–95, Taipei, Taiwan. Association for Computational Linguistics.
  • Řehůřek, Radim and Petr Sojka. 2010. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pages 45– 50, Valletta, Malta. ELRA. http://is.muni.cz/ publication/884893/en.
  • Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space.