Published May 2, 2023
| Version v1
Dataset
Open
Decade-level Word2Vec models from automatically transcribed 19th-century newspapers digitised by the British Library (1800-1919)
Description
Word embeddings trained on a 4.2-billion-word corpus of 19th-century British newspapers using Word2Vec and the following parameters:
sg = True
min_count = 5
window = 5
vector_size = 100
epochs = 5
The embeddings are divided into periods of ten years each. Unlike those in this repository, these were not aligned and OCR errors skimmed from the vocabulary.
See related GitHub repository for the full documentation: https://github.com/Living-with-machines/DiachronicEmb-BigHistData
Project website (Living with Machines): https://livingwithmachines.ac.uk/
Files
Files
(5.7 GB)
Name | Size | Download all |
---|---|---|
md5:0fc895404e13292bd1dba30d9cd0aaf3
|
6.4 MB | Download |
md5:a7b1a62988155fc5c28ecad27b0683ce
|
166.9 MB | Download |
md5:b2aef907377983b6c409626f97a7474b
|
166.9 MB | Download |
md5:e2e41500173f8607042a1f91b29d3e31
|
7.2 MB | Download |
md5:5bbdf6d5af76b45fb75dfdec03283757
|
189.0 MB | Download |
md5:ac24ab1a34c3a0414b88c1949767bd01
|
189.0 MB | Download |
md5:c1b6b04a57184bef4d98ac85d16da1e7
|
6.2 MB | Download |
md5:deef39fd6bdab1949ed3c23a5a5ecaee
|
162.4 MB | Download |
md5:8365e8b3a755dcba95837f3b21acaf58
|
162.4 MB | Download |
md5:e81376f291644abe2aadbdc05b032df5
|
4.2 MB | Download |
md5:c47395e417191a0a20b31ea1388669f3
|
110.3 MB | Download |
md5:2e71ee0b5d2d8546cc367ea87f3addd3
|
110.3 MB | Download |
md5:aced72109372714b7ed443d3330b797b
|
9.7 MB | Download |
md5:039cddb139b92b954fcfd07f15e3690b
|
248.7 MB | Download |
md5:d01c3d629869169ab75024eb56a5131e
|
248.7 MB | Download |
md5:dd71627cb0ec89a0f35ad4f8e6ca0b5a
|
12.5 MB | Download |
md5:6a8d3ee5c54523559ce7477dc3d92157
|
320.5 MB | Download |
md5:b850161f2d2363be9b6dba5984b6fc47
|
320.5 MB | Download |
md5:fc8a7eea789e89ec54af995dbb36150d
|
10.6 MB | Download |
md5:7f13288e4a63331dc794057f97182bda
|
272.2 MB | Download |
md5:dd91231efbd21554d0bc13dcdeaf837a
|
272.2 MB | Download |
md5:03c6b6b9ae7bdfb09f0f5c677cdb4b0e
|
11.1 MB | Download |
md5:c404cdd7165de3ec307ead1db6875adb
|
285.1 MB | Download |
md5:282cee1c8b7dfe5b8b32773af1f89ff2
|
285.1 MB | Download |
md5:1556e84602c22625a6cdbf640ef0e700
|
13.4 MB | Download |
md5:b303b5c112cd68ee8740d83bc78af3ad
|
343.1 MB | Download |
md5:2dc05d0a307fc8e5a6818e37d7258660
|
343.1 MB | Download |
md5:9cd2bf30af8a670a97bda832f0a577af
|
12.0 MB | Download |
md5:cd3b39bda2f7e0d3fc51a7705e250986
|
309.1 MB | Download |
md5:9820abb128da08a3ffebcae7dd9efb90
|
309.1 MB | Download |
md5:e7c3724b3ed3d1ee249aeadceffc78dd
|
8.7 MB | Download |
md5:dab8117568667aca3935e32efd82dfc6
|
225.3 MB | Download |
md5:eb7f997f875f2eafe27adc36be64f3f8
|
225.3 MB | Download |
md5:75b9f26bbc274b76b88142e76e8bc170
|
6.5 MB | Download |
md5:93b97187c09ab6344772541d5f400671
|
170.3 MB | Download |
md5:96c2eb19165029f1843d7fc334f00a89
|
170.3 MB | Download |
Additional details
Related works
- Is continued by
- Dataset: 10.5281/zenodo.7181682 (DOI)
Funding
- Living with Machines AH/S01179X/1
- UK Research and Innovation
References
- Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space.
- Řehůřek, Radim and Petr Sojka. 2010. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pages 45– 50, Valletta, Malta. ELRA. http://is.muni.cz/ publication/884893/en.
- Pedrazzini, Nilo and Barbara McGillivray. 2022. Machines in the media: semantic change in the lexicon of mechanization in 19th-century British newspapers. In Proceedings of the 2nd International Workshop on Natural Language Processing for Digital Humanities, pages 85–95, Taipei, Taiwan. Association for Computational Linguistics.