Published May 3, 2023 | Version v1
Dataset Open

Diachronic and diatopic word embeddings from newspapers digitised by the British Library (1830-1889): North and South England

  • 1. The Alan Turing Institute/University of Oxford
  • 2. King's College London/The Alan Turing Institute

Description

Diachronic word embeddings (decade-level) trained with Word2Vec (via Gensim) on different geographic subcorpora of the Heritage Made Digital British and the Living with Machines historical newspaper collections:

- North England (north.zip)

- South England (south.zip)

At the moment, for each subcorpus, Word2Vec models are available for each decade in the period 1830-1889. More models are on the way for the following:

- each decade in the periods 1780-1829 and 1890-1920 for both North and South England.

- diachronic models for the following regions: Scotland, Wales, and Midlands.

The models were trained using the following parameters:

sg = True
min_count = 1
window = 5
vector_size = 200
epochs = 5

Like the embeddings in this repository, the model for each decade was aligned to the most recent one with Orthogonal Procrustes.

See related GitHub repository for the full documentation: https://github.com/Living-with-machines/DiachronicEmb-BigHistData.

Project website (Living with Machines): https://livingwithmachines.ac.uk/

Data related to: Nilo Pedrazzini & Barbara McGillivray, Diachronic and diatopic word embeddings from British historical newspapers, presented at AIUCD (Convegno dell’Associazione per l’Informatica Umanistica e la Cultura Digitale) in Siena (Italy), June 2023.

Files

north.zip

Files (4.8 GB)

Name Size Download all
md5:9f101d95740711806e2e76af217c175c
1.6 GB Preview Download
md5:e6800bfa4303fc505e085607fd183163
3.2 GB Preview Download

Additional details

Related works

Is continued by
Dataset: 10.5281/zenodo.7181682 (DOI)

Funding

Living with Machines AH/S01179X/1
UK Research and Innovation

References

  • Pedrazzini, Nilo and Barbara McGillivray. 2022. Machines in the media: semantic change in the lexicon of mechanization in 19th-century British newspapers. In Proceedings of the 2nd International Workshop on Natural Language Processing for Digital Humanities, pages 85–95, Taipei, Taiwan. Association for Computational Linguistics.
  • Řehůřek, Radim and Petr Sojka. 2010. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pages 45– 50, Valletta, Malta. ELRA. http://is.muni.cz/ publication/884893/en.
  • Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space.