Diachronic word embeddings from 19th-century newspapers digitised by the British Library (1800-1919)
Creators
- 1. The Alan Turing Institute
- 2. King's College London
Description
Word vectors related to the paper Machines in the media: semantic change in the lexicon of mechanization in 19th-century British newspapers by Nilo Pedrazzini and Barbara McGillivray (2022).
The embeddings were trained on a 4.2-billion-word corpus of 19th-century British newspapers using Word2Vec and the following parameters:
sg = True
min_count = 1
window = 3
vector_size = 200
epochs = 5
The embeddings are divided into periods of ten years each, with the vectors from each decade aligned to the ones from the most recent decade (1910s) using Orthogonal Procrustes.
See related GitHub repository for the full documentation: https://github.com/Living-with-machines/DiachronicEmb-BigHistData
Project webpage (Living with Machines): https://livingwithmachines.ac.uk/
Files
1800s-vectors.txt
Files
(1.5 GB)
Name | Size | Download all |
---|---|---|
md5:1204d4a412a180418c0b410ca4ca4eee
|
160.5 MB | Preview Download |
md5:dd577a14b1117fb23ef9c1dd0c1b112e
|
155.0 MB | Preview Download |
md5:efbfde0120c138826337a2003f80c58a
|
152.4 MB | Preview Download |
md5:61a8b593770ca8034bf1e92bfaf7f6be
|
111.8 MB | Preview Download |
md5:0782d76c820d194f7a85152f800a2d9e
|
110.3 MB | Preview Download |
md5:55140907dae77c956b3f25a1ad9d617f
|
110.1 MB | Preview Download |
md5:49c9dead811376551c9b11c51ea9898e
|
110.2 MB | Preview Download |
md5:828b099ea3701a14cddd92e5227ea038
|
110.1 MB | Preview Download |
md5:029a6fc43ed58ca66cbab25e0ee97023
|
110.1 MB | Preview Download |
md5:39b5d0e4c87b3ba0c1e6a53ae0ff6c8a
|
110.1 MB | Preview Download |
md5:12233553e0832fef20e04fd95b17ccc7
|
110.3 MB | Preview Download |
md5:45267014edb7c22358fe484cedce7bb6
|
185.3 MB | Preview Download |
Additional details
Related works
- Continues
- Dataset: 10.5281/zenodo.7887305 (DOI)
Funding
- Living with Machines AH/S01179X/1
- UK Research and Innovation
References
- Pedrazzini, Nilo and Barbara McGillivray. 2022. Machines in the media: semantic change in the lexicon of mechanization in 19th-century British newspapers. In Proceedings of the 2nd International Workshop on Natural Language Processing for Digital Humanities, pages 85–95, Taipei, Taiwan. Association for Computational Linguistics.
- Řehůřek, Radim and Petr Sojka. 2010. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pages 45– 50, Valletta, Malta. ELRA. http://is.muni.cz/ publication/884893/en.
- Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space.