Published May 2, 2023 | Version v1
Dataset Open

Decade-level Word2Vec models from automatically transcribed 19th-century newspapers digitised by the British Library (1800-1919)

  • 1. The Alan Turing Institute

Description

Word embeddings trained on a 4.2-billion-word corpus of 19th-century British newspapers using Word2Vec and the following parameters:

sg = True
min_count = 5
window = 5
vector_size = 100
epochs = 5

The embeddings are divided into periods of ten years each. Unlike those in this repository, these were not aligned and OCR errors skimmed from the vocabulary. 

See related GitHub repository for the full documentation: https://github.com/Living-with-machines/DiachronicEmb-BigHistData

Project website (Living with Machines): https://livingwithmachines.ac.uk/

Files

Files (5.7 GB)

Name Size Download all
md5:0fc895404e13292bd1dba30d9cd0aaf3
6.4 MB Download
md5:a7b1a62988155fc5c28ecad27b0683ce
166.9 MB Download
md5:b2aef907377983b6c409626f97a7474b
166.9 MB Download
md5:e2e41500173f8607042a1f91b29d3e31
7.2 MB Download
md5:5bbdf6d5af76b45fb75dfdec03283757
189.0 MB Download
md5:ac24ab1a34c3a0414b88c1949767bd01
189.0 MB Download
md5:c1b6b04a57184bef4d98ac85d16da1e7
6.2 MB Download
md5:deef39fd6bdab1949ed3c23a5a5ecaee
162.4 MB Download
md5:8365e8b3a755dcba95837f3b21acaf58
162.4 MB Download
md5:e81376f291644abe2aadbdc05b032df5
4.2 MB Download
md5:c47395e417191a0a20b31ea1388669f3
110.3 MB Download
md5:2e71ee0b5d2d8546cc367ea87f3addd3
110.3 MB Download
md5:aced72109372714b7ed443d3330b797b
9.7 MB Download
md5:039cddb139b92b954fcfd07f15e3690b
248.7 MB Download
md5:d01c3d629869169ab75024eb56a5131e
248.7 MB Download
md5:dd71627cb0ec89a0f35ad4f8e6ca0b5a
12.5 MB Download
md5:6a8d3ee5c54523559ce7477dc3d92157
320.5 MB Download
md5:b850161f2d2363be9b6dba5984b6fc47
320.5 MB Download
md5:fc8a7eea789e89ec54af995dbb36150d
10.6 MB Download
md5:7f13288e4a63331dc794057f97182bda
272.2 MB Download
md5:dd91231efbd21554d0bc13dcdeaf837a
272.2 MB Download
md5:03c6b6b9ae7bdfb09f0f5c677cdb4b0e
11.1 MB Download
md5:c404cdd7165de3ec307ead1db6875adb
285.1 MB Download
md5:282cee1c8b7dfe5b8b32773af1f89ff2
285.1 MB Download
md5:1556e84602c22625a6cdbf640ef0e700
13.4 MB Download
md5:b303b5c112cd68ee8740d83bc78af3ad
343.1 MB Download
md5:2dc05d0a307fc8e5a6818e37d7258660
343.1 MB Download
md5:9cd2bf30af8a670a97bda832f0a577af
12.0 MB Download
md5:cd3b39bda2f7e0d3fc51a7705e250986
309.1 MB Download
md5:9820abb128da08a3ffebcae7dd9efb90
309.1 MB Download
md5:e7c3724b3ed3d1ee249aeadceffc78dd
8.7 MB Download
md5:dab8117568667aca3935e32efd82dfc6
225.3 MB Download
md5:eb7f997f875f2eafe27adc36be64f3f8
225.3 MB Download
md5:75b9f26bbc274b76b88142e76e8bc170
6.5 MB Download
md5:93b97187c09ab6344772541d5f400671
170.3 MB Download
md5:96c2eb19165029f1843d7fc334f00a89
170.3 MB Download

Additional details

Related works

Is continued by
Dataset: 10.5281/zenodo.7181682 (DOI)

Funding

Living with Machines AH/S01179X/1
UK Research and Innovation

References

  • Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space.
  • Řehůřek, Radim and Petr Sojka. 2010. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pages 45– 50, Valletta, Malta. ELRA. http://is.muni.cz/ publication/884893/en.
  • Pedrazzini, Nilo and Barbara McGillivray. 2022. Machines in the media: semantic change in the lexicon of mechanization in 19th-century British newspapers. In Proceedings of the 2nd International Workshop on Natural Language Processing for Digital Humanities, pages 85–95, Taipei, Taiwan. Association for Computational Linguistics.