There is a newer version of this record available.

Dataset Open Access

Models for "A data-driven approach to the changing vocabulary of the 'nation' in English, Dutch, Swedish and Finnish newspapers, 1750-1950"

Hengchen, Simon; Ros, Ruben; Marjanen, Jani

Citation Style Language JSON Export

  "publisher": "Zenodo", 
  "DOI": "10.5281/zenodo.3270648", 
  "language": "eng", 
  "title": "Models for \"A data-driven approach to the changing vocabulary of the 'nation' in English, Dutch, Swedish and Finnish newspapers, 1750-1950\"", 
  "issued": {
    "date-parts": [
  "abstract": "<p>NOTE: This is a badly rendered version of the README within the archive.</p>\n\n<p><strong>A data-driven approach to the changing vocabulary of the &lsquo;nation&rsquo; in English, Dutch, Swedish and Finnish newspapers, 1750-1950</strong></p>\n\n<p>Simon Hengchen*, Ruben Ros**, Jani Marjanen*</p>\n\n<p>*University of Helsinki:&nbsp;<a href=\"\"></a>; ** Utrecht University:&nbsp;<a href=\"\"></a></p>\n\n<p>These are the supplementary materials for the DH2019 paper&nbsp;<em>A data-driven approach to the changing vocabulary of the &lsquo;nation&rsquo; in English, Dutch, Swedish and Finnish newspapers, 1750-1950</em>. If you end up using whole or parts of this resource, please use the following citation:</p>\n\n<ul>\n\t<li>Hengchen, S., Ros, R., and Marjanen, J. (2019). A data-driven approach to the changing vocabulary of the &lsquo;nation&rsquo; in English, Dutch, Swedish and Finnish newspapers, 1750-1950. In&nbsp;<em>Proceedings of the Digital Humanities (DH) conference 2019, Utrecht, The Netherlands</em>,<br>\n\tor alternatively use the following&nbsp;<code>bib</code>:</li>\n</ul>\n\n<pre><code>@inproceedings{hengchen2019nation,\n title=\"A data-driven approach to the changing vocabulary of the 'nation' in {E}nglish, {D}utch, {S}wedish and {F}innish newspapers, 1750-1950.\",\n author={Hengchen, Simon, and Ros, Ruben, and Marjanen, Jani},\n year={2019},\n booktitle={Proceedings of the Digital Humanities (DH) conference 2019, Utrecht, The Netherlands}\n }\n</code></pre>\n\n<p><strong>Files</strong></p>\n\n<p>This archive contains three folders &ndash; one per language &ndash; as well as this README. The folders contain the models for their respective languages. This work is licensed under a&nbsp;<a href=\"\">Creative Commons Attribution-ShareAlike 4.0 International License</a>.</p>\n\n<p><strong>Source material</strong></p>\n\n<p><strong>Finnish:</strong></p>\n\n<p>The models were created with data from the Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland (National Library of Finland, 2011). We used everything in the corpus.</p>\n\n<p><strong>Swedish:</strong></p>\n\n<p>The models were created with data from the Kubhist corpus (Spr&aring;kbanken) &ndash; more precisely, the data dumps available at&nbsp;<a href=\"\"></a>. We used everything in the corpus.</p>\n\n<p><strong>Dutch:</strong></p>\n\n<p>The models were created with data from the Delpher newspaper archive (Royal Dutch Library, 2017), through data dumps for newspapers until and including 1876, and through API hits for articles from 1877 to 1899 (included).</p>\n\n<ul>\n\t<li>For anything pre-1877 we discarded full texts that had, in the metadata, anything else than exclusively&nbsp;<code>nl</code>&nbsp;or&nbsp;<code>NL</code>&nbsp;as a language tag.</li>\n\t<li>For the full texts between 1877 and 1899: we queried the API for all items in the &ldquo;artikel&rdquo; category that contained the determiner&nbsp;<code>de</code>.</li>\n</ul>\n\n<p>Our assumption was that most articles should contain&nbsp;<code>de</code>&nbsp;at least once, and those that didn&rsquo;t were too short to be deemed interesting. A subsequent study showed that was not exactly the case, but we were reassured by the fact that left-out articles were probably &ldquo;shipping or financial reports&rdquo; (thanks go to Melvin Wevers).<br>\nWe also did not include the colonial newspapers for our embeddings. This is motivated by two reasons, important in the context of our research question: first, only the Dutch dataset has an extensive coverage of colonial newspapers &ndash; including them would have weakened our comparisons with the other countries in our studies. Second, Dutch colonial newspapers &ldquo;showed a great uniformity&rdquo; because &ldquo;their news supply was unique and controlled by the official news agency, ANETA&rdquo;. (Our translation and paraphrasing of Witte 1998:18). A list of removed newspapers is available on request.</p>\n\n<p><strong>English:</strong></p>\n\n<p>The embeddings are not available at this moment. Do get in touch with Simon if you want to receive an email when they become available.</p>\n\n<p><strong>Word embeddings</strong></p>\n\n<p>For every language, we train diachronic embeddings as follows. We divide the data in 20-year time bins. We first train a model for the first time bin&nbsp;<code>t</code>. To train the model for&nbsp;<code>t+1</code>, we use the&nbsp;<code>t</code>&nbsp;model to initialise the vectors for&nbsp;<code>t+1</code>, set the learning rate to correspond to the end learning rate of&nbsp;<code>t</code>, and continue training. This approach, closely following Kim et al (2014), has the advantage of avoiding the need for post-training vector space alignment.<br>\nParameters are as follows: CBOW architecture (Mikolov et al 2013), window size of 5, frequency threshold of 100, 5 epochs.&nbsp;<strong>No tokens (including punctuation) were removed nor altered, aside from lowercasing</strong>.</p>\n\n<p>The Python snippet below, which makes use of gensim (Rehurek and Sojka, 2010), illustrates the approach. Special thanks go to Sara Budts.</p>\n\n<pre><code>lang = \"fi\" ## change this\nclass MySentences(object): ## This class has been directly copied from gensim.\n    def __init__(self, liste):\n        self.liste = liste\n    def __iter__(self):\n        for file in self.liste:\n             with open(file, \"r\") as fname: ## Our files are line-separated, with tab-separated, lowercased tokens\n                for line in fname:\n                    yield line.split(\"\\t\")\n        \ncount = 0                    \nfor key in sorted(list(dict_files.keys())): # dict_files is a dict with time bins as keys, and lists of filepaths as values\n    number_of_files = len(dict_files[key])\n    print(\"Number of files for double decade starting in\",str(key),\"is\",str(number_of_files))\n    if os.path.exists(dir_out+\"/model_\"+lang+\"_\"+str(key)+\".w2v\") == False:\n        print(\"model_\"+lang+\"_\"+str(key)+\" does not exist, running\")\n        if number_of_files &gt; 0:\n            if count == 0: ## This is the first model.\n                count += 1\n                sentences = MySentences(dict_files[key])\n                model = gensim.models.Word2Vec(sentences, min_count=100, workers=14, seed=1830, epochs=5)\n      \"/model_\"+lang+\"_\"+str(key)+\".w2v\")\n                print(\"Model saved, on to the next\\n\")\n            if count &gt; 0: ## this is for the subsequent models.\n                print(\"model for double decade starting in\",str(key))\n                model = gensim.models.Word2Vec.load(dir_out+\"/model_\"+lang+\"_\"+str(key-20)+\".w2v\") ## If the script crashes, we make sure to have the latest model.\n                sentences = MySentences(dict_files[key])\n                model.build_vocab(sentences, update=True)\n                model.train(sentences, total_examples = model.corpus_count, start_alpha = model.alpha, end_alpha = model.min_alpha, epochs = model.iter)\n      \"/model_\"+lang+\"_\"+str(key)+\".w2v\")\n                print(\"Model saved, on to the next\\n\")\n\n</code></pre>\n\n<p><strong>Acknowledgments</strong></p>\n\n<p>Part of this work has been supported by the European Union&rsquo;s Horizon 2020 research and innovation programme under grant 770299&nbsp;<a href=\"\">NewsEye</a>.<br>\nThe authors would like to thank the following persons and institutions, listed in random order:<br>\nSteven Claeyssens, Mikko Tolonen, Nina Tahmasebi, Melvin Wevers, Eetu M&auml;kel&auml;, the COMHIS group, Lidia Pivovarova, Antti Kanner, Senka Drobac, Krister Lind&eacute;n, Lars Borin, Dominik Schlechtweg, Sara Budts, Haim Dubossarsky, Estelle Bunout, Axel-Jean Caurant, Tanja S&auml;ily, Elaine Zosa, and Joris van Eijnatten.</p>\n\n<p><strong>References</strong></p>\n\n<p>Kim, Y., Chiu, Y.I., Hanaki, K., Hegde, D. and Petrov, S. (2014). Temporal Analysis of Language through Neural Language Models.&nbsp;<em>ACL 2014</em>, p.61.<br>\nMikolov, T., Chen, K., Corrado, G. and Dean, J. (2013). Efficient estimation of word representations in vector space.&nbsp;<em>arXiv preprint arXiv:1301.3781</em>.<br>\nNational Library of Finland (2011).&nbsp;<em>The Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland, Kielipankki Version</em>&nbsp;[text corpus]. Kielipankki. Retrieved from&nbsp;<a href=\"\"></a>.<br>\nRehurek, R. and Sojka, P. (2010). Software framework for topic modelling with large corpora. In&nbsp;<em>Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks</em>.<br>\nRoyal Dutch Library (2017).&nbsp;<em>Delpher open krantenarchief (1.0)</em>. Den Haag, 2017.<br>\nSpr&aring;kbanken.&nbsp;<em>The Kubhist Corpus</em>. Department of Swedish, University of Gothenburg.&nbsp;<a href=\"\"></a>.<br>\nWitte, R. (1998).&nbsp;<em>De Indische radio-omroep: overheidsbeleid en ontwikkeling, 1923-1942</em>. Uitgeverij Verloren.</p>", 
  "author": [
      "family": "Hengchen, Simon"
      "family": "Ros, Ruben"
      "family": "Marjanen, Jani"
  "version": "1.0.0", 
  "type": "dataset", 
  "id": "3270648"
All versions This version
Views 2,064679
Downloads 57946
Data volume 9.3 TB146.7 GB
Unique views 1,740624
Unique downloads 53332


Cite as