Models for "A data-driven approach to the changing vocabulary of the 'nation' in English, Dutch, Swedish and Finnish newspapers, 1750-1950"
Creators
- 1. University of Helsinki
- 2. Utrecht University
Description
NOTE: This is a badly rendered version of the README within the archive.
A data-driven approach to the changing vocabulary of the ‘nation’ in English, Dutch, Swedish and Finnish newspapers, 1750-1950
Simon Hengchen*, Ruben Ros**, Jani Marjanen*
*University of Helsinki: firstname.lastname@helsinki.fi; ** Utrecht University: r.s.lastname@students.uu.nl
These are the supplementary materials for the DH2019 paper A data-driven approach to the changing vocabulary of the ‘nation’ in English, Dutch, Swedish and Finnish newspapers, 1750-1950. If you end up using whole or parts of this resource, please use the following citation:
- Hengchen, S., Ros, R., and Marjanen, J. (2019). A data-driven approach to the changing vocabulary of the ‘nation’ in English, Dutch, Swedish and Finnish newspapers, 1750-1950. In Proceedings of the Digital Humanities (DH) conference 2019, Utrecht, The Netherlands,
or alternatively use the followingbib
:
@inproceedings{hengchen2019nation,
title="A data-driven approach to the changing vocabulary of the 'nation' in {E}nglish, {D}utch, {S}wedish and {F}innish newspapers, 1750-1950.",
author={Hengchen, Simon, and Ros, Ruben, and Marjanen, Jani},
year={2019},
booktitle={Proceedings of the Digital Humanities (DH) conference 2019, Utrecht, The Netherlands}
}
Files
This archive contains three folders – one per language – as well as this README. The folders contain the models for their respective languages. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Source material
Finnish:
The models were created with data from the Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland (National Library of Finland, 2011). We used everything in the corpus.
Swedish:
The models were created with data from the Kubhist corpus (Språkbanken) – more precisely, the data dumps available at https://spraakbanken.gu.se/swe/resurser/. We used everything in the corpus.
Dutch:
The models were created with data from the Delpher newspaper archive (Royal Dutch Library, 2017), through data dumps for newspapers until and including 1876, and through API hits for articles from 1877 to 1899 (included).
- For anything pre-1877 we discarded full texts that had, in the metadata, anything else than exclusively
nl
orNL
as a language tag. - For the full texts between 1877 and 1899: we queried the API for all items in the “artikel” category that contained the determiner
de
.
Our assumption was that most articles should contain de
at least once, and those that didn’t were too short to be deemed interesting. A subsequent study showed that was not exactly the case, but we were reassured by the fact that left-out articles were probably “shipping or financial reports” (thanks go to Melvin Wevers).
We also did not include the colonial newspapers for our embeddings. This is motivated by two reasons, important in the context of our research question: first, only the Dutch dataset has an extensive coverage of colonial newspapers – including them would have weakened our comparisons with the other countries in our studies. Second, Dutch colonial newspapers “showed a great uniformity” because “their news supply was unique and controlled by the official news agency, ANETA”. (Our translation and paraphrasing of Witte 1998:18). A list of removed newspapers is available on request.
English:
The embeddings are not available at this moment. Do get in touch with Simon if you want to receive an email when they become available.
Word embeddings
For every language, we train diachronic embeddings as follows. We divide the data in 20-year time bins. We first train a model for the first time bin t
. To train the model for t+1
, we use the t
model to initialise the vectors for t+1
, set the learning rate to correspond to the end learning rate of t
, and continue training. This approach, closely following Kim et al (2014), has the advantage of avoiding the need for post-training vector space alignment.
Parameters are as follows: CBOW architecture (Mikolov et al 2013), window size of 5, frequency threshold of 100, 5 epochs. No tokens (including punctuation) were removed nor altered, aside from lowercasing.
The Python snippet below, which makes use of gensim (Rehurek and Sojka, 2010), illustrates the approach. Special thanks go to Sara Budts.
lang = "fi" ## change this
class MySentences(object): ## This class has been directly copied from gensim.
def __init__(self, liste):
self.liste = liste
def __iter__(self):
for file in self.liste:
with open(file, "r") as fname: ## Our files are line-separated, with tab-separated, lowercased tokens
for line in fname:
yield line.split("\t")
count = 0
for key in sorted(list(dict_files.keys())): # dict_files is a dict with time bins as keys, and lists of filepaths as values
number_of_files = len(dict_files[key])
print("Number of files for double decade starting in",str(key),"is",str(number_of_files))
if os.path.exists(dir_out+"/model_"+lang+"_"+str(key)+".w2v") == False:
print("model_"+lang+"_"+str(key)+" does not exist, running")
if number_of_files > 0:
if count == 0: ## This is the first model.
count += 1
sentences = MySentences(dict_files[key])
model = gensim.models.Word2Vec(sentences, min_count=100, workers=14, seed=1830, epochs=5)
model.save(dir_out+"/model_"+lang+"_"+str(key)+".w2v")
print("Model saved, on to the next\n")
if count > 0: ## this is for the subsequent models.
print("model for double decade starting in",str(key))
model = gensim.models.Word2Vec.load(dir_out+"/model_"+lang+"_"+str(key-20)+".w2v") ## If the script crashes, we make sure to have the latest model.
sentences = MySentences(dict_files[key])
model.build_vocab(sentences, update=True)
model.train(sentences, total_examples = model.corpus_count, start_alpha = model.alpha, end_alpha = model.min_alpha, epochs = model.iter)
model.save(dir_out+"/model_"+lang+"_"+str(key)+".w2v")
print("Model saved, on to the next\n")
Acknowledgments
Part of this work has been supported by the European Union’s Horizon 2020 research and innovation programme under grant 770299 NewsEye.
The authors would like to thank the following persons and institutions, listed in random order:
Steven Claeyssens, Mikko Tolonen, Nina Tahmasebi, Melvin Wevers, Eetu Mäkelä, the COMHIS group, Lidia Pivovarova, Antti Kanner, Senka Drobac, Krister Lindén, Lars Borin, Dominik Schlechtweg, Sara Budts, Haim Dubossarsky, Estelle Bunout, Axel-Jean Caurant, Tanja Säily, Elaine Zosa, and Joris van Eijnatten.
References
Kim, Y., Chiu, Y.I., Hanaki, K., Hegde, D. and Petrov, S. (2014). Temporal Analysis of Language through Neural Language Models. ACL 2014, p.61.
Mikolov, T., Chen, K., Corrado, G. and Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
National Library of Finland (2011). The Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland, Kielipankki Version [text corpus]. Kielipankki. Retrieved from http://urn.fi/urn:nbn:fi:lb-2016050302.
Rehurek, R. and Sojka, P. (2010). Software framework for topic modelling with large corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks.
Royal Dutch Library (2017). Delpher open krantenarchief (1.0). Den Haag, 2017.
Språkbanken. The Kubhist Corpus. Department of Swedish, University of Gothenburg. https://spraakbanken.gu.se/korp/?mode=kubhist.
Witte, R. (1998). De Indische radio-omroep: overheidsbeleid en ontwikkeling, 1923-1942. Uitgeverij Verloren.
Files
Files
(3.2 GB)
Name | Size | Download all |
---|---|---|
md5:ec70a9adfe84f3909b6b1d3e8a46b3a6
|
3.2 GB | Download |