Dutch Historical Word2Vec models

doi:10.5281/zenodo.4892800

Published June 4, 2021 | Version 1

Dataset Open

Dutch Historical Word2Vec models

1. Alan Turing Institute
2. Koninklijke Bibliotheek

Introduction

The repository contains Word2Vec models trained on Dutch historical newspaper data converting the period from 1840 to 1890. Models were created as part of a Research-in-Residence at the Dutch National Library. During my residency, I created language models trained on specific subsections of the newspaper corpus, to explore bias over time and by place or political leaning.

To read more about this project, please read the introductory blog post.

Code

The code used for training the models is available on Github. Please look at the README for more instruction.

Warning: the raw text data used was provided by Mirjam Cuper of the KB and is available only on request.

Some code for loading and exploring the models is also available on Github.

For more information on interactive lexicon creation using these models, go to this README.

For more information on exploring bias on these model, go to this README.

Models

Models are available in zip files, one for each decade. We trained models using a window size of twenty years and a step size of five. The structure of the file names is as follows:

{year_start}-{year_end}-{attribute}.w2v.model .

For example 1840-1860-Protestant.w2v.model is trained on all article published in protestant newspapers between 1840 and 1860.

The attribute value is chosen from either the Politiek (political leaning) or Provincie (province) column in this metadata file.

Files

1840.zip

Files (36.7 GB)

Name	Size	Download all
1840.zip md5:9e89f31134cf59cf7174ddbf04a2281a	5.7 GB	Preview Download
1850.zip md5:6a51435b685a7cbabb1cb6c890dd06da	6.1 GB	Preview Download
1860.zip md5:fe7ca03400ce3279764d019504da6815	6.4 GB	Preview Download
1870.zip md5:21f067db59974ab5d6bc946eb8adc94f	7.2 GB	Preview Download
1880.zip md5:e11f3a902111d8a81c06697e1b99677f	11.2 GB	Preview Download

	All versions	This version
Views	320	318
Downloads	11	11
Data volume	81.9 GB	81.9 GB

Dutch Historical Word2Vec models

Creators

Description

Files

1840.zip

Files (36.7 GB)