Neural Language Models for Nineteenth-Century English (dataset; language model zoo)

Hosseini, Kasra; Beelen, Kaspar; Colavizza, Giovanni; Coll Ardanuy, Mariona

doi:10.5281/zenodo.4782245

Published May 23, 2021 | Version 1.0.1

Dataset Open

Neural Language Models for Nineteenth-Century English (dataset; language model zoo)

1. The Alan Turing Institute, London, UK
2. Institute for Logic, Language and Computation, University of Amsterdam, NL

This dataset contains four types of neural language models trained on a large historical dataset of books in English, published between 1760-1900 and comprised of ~5.1 billion tokens. The language model architectures include static (word2vec and fastText) and contextualized models (BERT and Flair). For each architecture, we trained a model instance using the whole dataset. Additionally, we trained separate instances on text published before 1850 for the two static models, and four instances considering different time slices for BERT.

Github repository: https://github.com/Living-with-machines/histLM

Files

bert.zip

Files (13.2 GB)

Name	Size	Download all
bert.zip md5:fea637f1dd685fef5301490ee9cffbb0	2.0 GB	Preview Download
fasttext.zip md5:f60c2b92ea99e6e2245bbbaca82b427f	8.5 GB	Preview Download
flair.zip md5:0f29ad54b98a841fe57e7e5b003b180c	71.0 MB	Preview Download
README.md md5:f074d7a054c8af29393e58f78649904a	3.3 kB	Preview Download
word2vec.zip md5:47f7ff9d77bf61ff2a20d7c641ca38af	2.6 GB	Preview Download

Additional details

Living with Machines AH/S01179X/1: UK Research and Innovation
The Alan Turing Institute EP/N510129/1: UK Research and Innovation

	All versions	This version
Views	1,746	1,514
Downloads	429	416
Data volume	1.3 TB	1.2 TB

Neural Language Models for Nineteenth-Century English (dataset; language model zoo)

Creators

Description

Files

bert.zip

Files (13.2 GB)

Additional details

Funding