Journal article Open Access

A collection of Swedish diachronic word embedding models trained on historical newspaper data

Hengchen, Simon; Tahmasebi, Nina

A collection of Swedish diachronic word embedding models trained on historical newspaper data

Simon Hengchen, Nina Tahmasebi

NOTE: this README.md is a summary. For all details, see the paper.

NOTE: this data release is available on Zenodo at https://zenodo.org/record/4301658

Description

This is the data release accompanying the Journal of Open Humanities Data paper "A collection of Swedish diachronic word embedding models trained on historical newspaper data." This paper describes the creation of several word embedding models based on a large collection of diachronic Swedish newspaper material available through Språkbanken Text, the Swedish language bank. This data was produced in the context of Språkbanken Text's continued mission to collaborate with humanities and natural language processing researchers and to provide freely available language resources, for the development of state-of-the-art NLP methods and tools.

Bibtex

If you use the models or the code provided in this paper, please cite the following:

@article{hengchen-tahmasebi-2021-collection, 
title = "A collection of {S}wedish diachronic word embedding models trained on historical newspaper data", 
author = "Hengchen, Simon and Tahmasebi, Nina", 
journal = "Journal of Open Humanities Data", 
year = "2021"
} 

Overview

We release diachronic word2vec and fastText models in their skip-gram with negative sampling (SGNS) architecture. The models are trained on 20-year time bins, with two temporal alignment strategies: independently-trained models for post-hoc alignment, and incremental training. The independently-trained models are NOT aligned, leaving the choice of alignment to the end user. We provide code to reproduce our pipeline, and code examples to load and use the models.

Data

The entirety of the Kunglinga bibliotekets historiska tidningar (Kubhist 2) corpus was used. The original data was scanned and OCRed by the National Library of Sweden. It consists of Swedish newspapers from all parts of Sweden. It has since been run through the Sparv annotation pipeline by Martin Hammarstedt at Språkbanken Text.

Preprocessing

The text was retrieved from the original XML. The processing steps prior to training the models are:

  • lowercasing
  • removal of digits
  • removal of all characters not belonging to the Swedish alphabet (a-zäåö)
  • removal of tokens the length of which is two characters or smaller
  • merging of all texts pertaining to the same double decade (1740-1759; 1760-1779; ...)

Quality control

All models have been queried for some control analogies by a native speaker of Swedish. A (non-native speaker of Swedish) reviewer, whom we thank, also performed checks on the local neighbourhoods of selected terms, performed vector arithmetics, and confirmed the models behaved as expected.

Structure

ROOT/ 
    README.md 
    code/ 
        *.py files 
        requirements.txt 
    fasttext/ 
        incremental/ 
            *.ft files 
            *.npy files 
        indep/ 
            *.ft files 
            *.npy files 
    word2vec/ 
        incremental/ 
            *.w2v files 
            *.npy files 
        indep/ 
            *.w2v files 
            *.npy files 

Regarding the code:

  • kubhist_XML_to_gensim.py will transform the XML into "LineSentence", "clean" corpora
  • train_w2v-ft.py will train models
  • load_run_models.py will print some examples of what can be done with embeddings
  • utils.py contains the functions called by the scripts above
  • requirements.txt contains the output of pip freeze > requirements.txt, i.e. the python libraries needed to run the scripts above

Funding

This work has been funded in part by the project Towards Computational Lexical Semantic Change Detection supported by the Swedish Research Council (2019--2022; dnr 2018-01184), and Nationella Språkbanken (the Swedish National Language Bank) -- jointly funded by the Swedish Research Council (2018--2024; dnr 2017-00626) and its 10 partner institutions, to NT.

Files (16.2 GB)
Name Size
HENGCHEN-TAHMASEBI_-_2020_-_Kubhist2_diachronic_embeddings.zip
md5:26c174380d0c9c4bdf552bbe0e5ef325
16.2 GB Download
61
6
views
downloads
All versions This version
Views 6151
Downloads 66
Data volume 97.5 GB97.5 GB
Unique views 4945
Unique downloads 66

Share

Cite as