Dataset Open Access

# Models for "A data-driven approach to studying changing vocabularies in historical newspaper collections"

Hengchen, Simon; Ros, Ruben; Marjanen, Jani; Tolonen, Mikko

NOTE: This is a badly rendered version of the README within the archive.

A data-driven approach to studying changing vocabularies in historical newspaper collections

Simon Hengchen,* Ruben Ros,** Jani Marjanen,* Mikko Tolonen*

*COMHIS, University of Helsinki: firstname.lastname@helsinki.fi; **Utrecht University: firstname@firstnamelastname.nl

These are the supplementary materials for the DH2019 paper A data-driven approach to the changing vocabulary of the ‘nation’ in English, Dutch, Swedish and Finnish newspapers, 1750-1950, as well as an upcoming publication. If you end up using whole or parts of this resource, please use the following citation(s):

• Hengchen, S., Ros, R., and Marjanen, J. (2019). A data-driven approach to the changing vocabulary of the 'nation' in English, Dutch, Swedish and Finnish newspapers, 1750-1950. In Proceedings of the Digital Humanities (DH) conference 2019, Utrecht, The Netherlands

and/or:

• Hengchen, S., Ros, R., Marjanen, J., and Tolonen M. (To appear). A data-driven approach to studying changing vocabularies in historical newspaper collections. Digital Scholarship in the Humanities.

or alternatively use one of the following bibs:

@inproceedings{hengchen2019nation,
title="A data-driven approach to the changing vocabulary of the 'nation' in {E}nglish, {D}utch, {S}wedish and {F}innish newspapers, 1750-1950.",
author={Hengchen, Simon and Ros, Ruben and Marjanen, Jani},
year={2019},
booktitle={Proceedings of the Digital Humanities (DH) conference 2019}
}
@article{hengchen2020vocab,
title="A data-driven approach to studying changing vocabularies in historical newspaper collections",
author={Hengchen, Simon and Ros, Ruben and Marjanen, Jani and Tolonen, Mikko},
journal={Digital Scholarship in the Humanities},
year={to appear},
publisher={Oxford University Press}
}

A preprint for the article is available on request, do email Simon.

Files

This archive contains two folders -- one per diachronic representation method -- as well as this README. The folders each contain four folders, which contain the models for their respective languages. As can be inferred from the small datasize, most of the earlier models are not reliable and should not be used, but are still made available. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Source material

Finnish:

The models were created with data from the Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland (National Library of Finland, 2011). We used everything in the corpus.

Filesizes:

[simon@taito-login3 SGNS]$du -h fi* 12M fi_1820_SGNS_corpus_file.gensim 89M fi_1840_SGNS_corpus_file.gensim 797M fi_1860_SGNS_corpus_file.gensim 7.0G fi_1880_SGNS_corpus_file.gensim 22G fi_1900_SGNS_corpus_file.gensim Swedish: The models were created with data from the Kubhist 2 corpus (Språkbanken) -- more precisely, the data dumps available at https://spraakbanken.gu.se. After a manual evaluation of Swedish embeddings trained without pre-processing seemed to show that the embeddings were of low quality, we retrained models, only keeping sentences that were at least 10 tokens long and were constituted of at least 50% of lemmas as per the KORP processing pipeline (Borin et al, 2012). Filesizes: [simon@taito-login3 SGNS]$ du -h sv*
1.6M    sv_1740_SGNS_corpus_file.gensim
44M sv_1760_SGNS_corpus_file.gensim
124M    sv_1780_SGNS_corpus_file.gensim
228M    sv_1800_SGNS_corpus_file.gensim
678M    sv_1820_SGNS_corpus_file.gensim
1.6G    sv_1840_SGNS_corpus_file.gensim
4.5G    sv_1860_SGNS_corpus_file.gensim
6.5G    sv_1880_SGNS_corpus_file.gensim
113M    sv_1900_SGNS_corpus_file.gensim

Dutch:

The models were created with data from the Delpher newspaper archive (Royal Dutch Library, 2017), through data dumps for newspapers until and including 1876, and through API hits for articles from 1877 to 1899 (included).

• For anything pre-1877 we discarded full texts that had, in the metadata, anything else than exclusively nl or NL as a language tag.
• For the full texts between 1877 and 1899: we queried the API for all items in the “artikel” category that contained the determiner de.

Our assumption was that most articles should contain de at least once, and those that didn't were too short to be deemed interesting. A subsequent study showed that was not exactly the case, but we were reassured by the fact that left-out articles were probably "shipping or financial reports" (thanks go to Melvin Wevers). We also did not include the colonial newspapers for our embeddings. This is motivated by our research questions. A list of removed newspapers is available on request.

Filesizes:

[simon@taito-login3 SGNS]$du -h nl* 6.8M nl_1620_SGNS_corpus_file.gensim 7.9M nl_1640_SGNS_corpus_file.gensim 43M nl_1660_SGNS_corpus_file.gensim 78M nl_1680_SGNS_corpus_file.gensim 138M nl_1700_SGNS_corpus_file.gensim 243M nl_1720_SGNS_corpus_file.gensim 287M nl_1740_SGNS_corpus_file.gensim 431M nl_1760_SGNS_corpus_file.gensim 825M nl_1780_SGNS_corpus_file.gensim 1.2G nl_1800_SGNS_corpus_file.gensim 1.8G nl_1820_SGNS_corpus_file.gensim 3.1G nl_1840_SGNS_corpus_file.gensim 5.2G nl_1860_SGNS_corpus_file.gensim 13G nl_1880_SGNS_corpus_file.gensim English: The models were created with data from the British Library Newspapers collection (link), the Nichols collection (link), and the Burney collection (link). We used everything in the corpora. For English, only SGNS_ALIGN models are available. We thank Gale Cengage for their help with this project. Filesizes: [simon@taito-login3 SGNS]$ du -h en*
4.3M    en_1620_SGNS_corpus_file.gensim
11M en_1640_SGNS_corpus_file.gensim
11M en_1660_SGNS_corpus_file.gensim
106M    en_1680_SGNS_corpus_file.gensim
409M    en_1700_SGNS_corpus_file.gensim
1.7G    en_1720_SGNS_corpus_file.gensim
834M    en_1740_SGNS_corpus_file.gensim
2.4G    en_1760_SGNS_corpus_file.gensim
5.3G    en_1780_SGNS_corpus_file.gensim
5.5G    en_1800_SGNS_corpus_file.gensim
15G en_1820_SGNS_corpus_file.gensim
42G en_1840_SGNS_corpus_file.gensim
65G en_1860_SGNS_corpus_file.gensim
88G en_1880_SGNS_corpus_file.gensim
26G en_1900_SGNS_corpus_file.gensim
21G en_1920_SGNS_corpus_file.gensim
6.3G    en_1940_SGNS_corpus_file.gensim

Word embeddings

For every language, we train diachronic embeddings as follows. We divide the data in 20-year time bins. We train SGNS_UPDATE and SGNS_ALIGN models. Current research on German (Schlechtweg et al, 2019) and English (Shoemark et al, 2019) indicates you should use the SGNS_ALIGN models. For EN, FI, NL, no tokens (including punctuation) were removed nor altered, aside from lowercasing. For SV, see above. Parameters are as follows: SGNS architecture (Mikolov et al 2013), window size of 5, frequency threshold of 100, 5 epochs, 300 dimensions (or 100 for EN).

• For SGNS_UPDATE: We first train a model for the first time bin t. To train the model for t+1, we use the t model to initialise the vectors for t+1, set the learning rate to correspond to the end learning rate of t, and continue training. This approach, closely following Kim et al (2014), has the advantage of avoiding the need for post-training vector space alignment.

The Python snippet below, which makes use of gensim (Rehurek and Sojka, 2010), illustrates the approach. Special thanks go to Sara Budts.

## dict_files[key] is a dictionary with double decades as keys and a corresponding LineSentence object as value: https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.LineSentence

count = 0
for key in sorted(list(dict_files.keys())):

if count == 0: ## This is the first model.

model = gensim.models.Word2Vec(corpus_file=dict_files[key], min_count=100, sg=1 ,size=300, workers=64, seed=1830, iter=5)
model.save(os.path.join(data_path_final,"KIM",lang+"_"+str(timebin)+".w2v"))
print("Model saved, on to the next\n")
count += 1

if count > 0: ## this is for the subsequent models.
print("model for double decade starting in",str(key))

model.build_vocab(corpus_file=dict_files[key], update=True)
model.train(corpus_file=dict_files[key], total_words = model.corpus_count, total_examples = model.corpus_count, start_alpha = model.alpha, end_alpha = model.min_alpha, epochs=model.epochs)
model.save(os.path.join(data_path_final,"KIM",lang+"_"+str(timebin)+".w2v"))

• For SGNS_ALIGN: We independently train models for all time bins. The models in this repository are NOT aligned, leaving you the choice of how to align them. For example, here is a link to code by Ryan Heuser to do just that. Models were trained with the count == 0 scenario in the snippet above.

Acknowledgments

This work has been supported by the European Union's Horizon 2020 research and innovation programme under grant 770299 NewsEye. Specials thanks go to the data providers/collection-holding institutions: the Finnish Language Bank, the Swedish Language Bank, the Royal Dutch Library, and Gale Cengage.

The authors would like to thank the following persons and group, listed alphabetically: Antoine Doucet, Antti Kanner, Axel-Jean Caurant, Dominik Schlechtweg, Eetu Mäkelä, Elaine Zosa, Estelle Bunout, Haim Dubossarsky, Joris van Eijnatten, Krister Lindén, Lars Borin, Lidia Pivovarova, Melvin Wevers, Nina Tahmasebi, Sara Budts, Senka Drobac, Tanja Säily, the COMHIS group, and Steven Claeyssens. Computational resources were provided by CSC – IT Center for Science Ltd.

References

Borin, L., Forsberg, M., Roxendal, J. (2012). Korp-the corpus infrastructure of Spräkbanken,in: LREC. pp. 474–478.

Kim, Y., Chiu, Y.I., Hanaki, K., Hegde, D. and Petrov, S. (2014). Temporal Analysis of Language through Neural Language Models. ACL 2014, p.61.

Mikolov, T., Chen, K., Corrado, G. and Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.

National Library of Finland (2011). The Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland, Kielipankki Version [text corpus]. Kielipankki. Retrieved from http://urn.fi/urn:nbn:fi:lb-2016050302.

Rehurek, R. and Sojka, P. (2010). Software framework for topic modelling with large corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks.

Royal Dutch Library (2017). Delpher open krantenarchief (1.0). Den Haag, 2017.

Schlechtweg D., Hätty A, del Tredici M., and Schulte im Walde S. (2019). A Wind of Change: Detecting and Evaluating Lexical Semantic Change across Times and Domains. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Florence, Italy. ACL.

Shoemark, P., Liza, F.F., Nguyen, D., Hale, S. and McGillivray, B. (2019). Room to Glo: A Systematic Comparison of Semantic Change Detection Approaches with Word Embeddings. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (pp. 66-76), Hong Kong.

Språkbanken. The Kubhist Corpus. Department of Swedish, University of Gothenburg. https://spraakbanken.gu.se/korp/?mode=kubhist.

Files (17.2 GB)
Name Size
diachronic_embeddings_hengchen-ros-marjanen-tolonen-2019-12-19.tar
md5:1fb81b6fdab2617b385429cc7923ec27
17.2 GB
2,539
862
views