Dataset Open Access

Models for "A data-driven approach to studying changing vocabularies in historical newspaper collections"

Hengchen, Simon; Ros, Ruben; Marjanen, Jani; Tolonen, Mikko


JSON-LD (schema.org) Export

{
  "inLanguage": {
    "alternateName": "eng", 
    "@type": "Language", 
    "name": "English"
  }, 
  "description": "<p>NOTE: This is a badly rendered version of the README within the archive.</p>\n\n<p><strong>A data-driven approach to studying changing vocabularies in historical newspaper collections</strong></p>\n\n<p>Simon Hengchen,* Ruben Ros,** Jani Marjanen,* Mikko Tolonen*</p>\n\n<p>*<a href=\"https://www.helsinki.fi/en/researchgroups/computational-history\">COMHIS</a>, University of Helsinki:&nbsp;<a href=\"mailto:firstname.lastname@helsinki.fi\">firstname.lastname@helsinki.fi</a>; **Utrecht University:&nbsp;<a href=\"mailto:firstname@firstnamelastname.nl\">firstname@firstnamelastname.nl</a></p>\n\n<p>These are the supplementary materials for the DH2019 paper&nbsp;<em>A data-driven approach to the changing vocabulary of the &lsquo;nation&rsquo; in English, Dutch, Swedish and Finnish newspapers, 1750-1950</em>, as well as an upcoming publication. If you end up using whole or parts of this resource, please use the following citation(s):</p>\n\n<ul>\n\t<li>Hengchen, S., Ros, R., and Marjanen, J. (2019). A data-driven approach to the changing vocabulary of the &#39;nation&#39; in English, Dutch, Swedish and Finnish newspapers, 1750-1950. In&nbsp;<em>Proceedings of the Digital Humanities (DH) conference 2019, Utrecht, The Netherlands</em></li>\n</ul>\n\n<p>and/or:</p>\n\n<ul>\n\t<li>Hengchen, S., Ros, R., Marjanen, J., and Tolonen M. (To appear). A data-driven approach to studying changing vocabularies in historical newspaper collections.&nbsp;<em>Digital Scholarship in the Humanities</em>.</li>\n</ul>\n\n<p>or alternatively use one of the following&nbsp;<code>bib</code>s:</p>\n\n<pre><code>@inproceedings{hengchen2019nation,\n title=\"A data-driven approach to the changing vocabulary of the 'nation' in {E}nglish, {D}utch, {S}wedish and {F}innish newspapers, 1750-1950.\",\n author={Hengchen, Simon and Ros, Ruben and Marjanen, Jani},\n year={2019},\n address = \"Utrecht, The Netherlands\",\n booktitle={Proceedings of the Digital Humanities (DH) conference 2019}\n }</code></pre>\n\n<pre><code>@article{hengchen2020vocab,\n title=\"A data-driven approach to studying changing vocabularies in historical newspaper collections\",\n author={Hengchen, Simon and Ros, Ruben and Marjanen, Jani and Tolonen, Mikko},\n journal={Digital Scholarship in the Humanities},\n year={to appear},\n publisher={Oxford University Press}\n }</code></pre>\n\n<p>A preprint for the article is available on request, do email Simon.</p>\n\n<p>Files</p>\n\n<p>This archive contains two folders -- one per diachronic representation method -- as well as this README. The folders each contain four folders, which contain the models for their respective languages. As can be inferred from the small datasize, most of the earlier models are not reliable and should not be used, but are still made available. This work is licensed under a&nbsp;<a href=\"http://creativecommons.org/licenses/by-sa/4.0/\">Creative Commons Attribution-ShareAlike 4.0 International License</a>.</p>\n\n<p><strong>Source material</strong></p>\n\n<p>Finnish:</p>\n\n<p>The models were created with data from the Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland (National Library of Finland, 2011). We used everything in the corpus.</p>\n\n<p>Filesizes:</p>\n\n<pre><code>[simon@taito-login3 SGNS]$ du -h fi*\n12M fi_1820_SGNS_corpus_file.gensim\n89M fi_1840_SGNS_corpus_file.gensim\n797M    fi_1860_SGNS_corpus_file.gensim\n7.0G    fi_1880_SGNS_corpus_file.gensim\n22G fi_1900_SGNS_corpus_file.gensim</code></pre>\n\n<p>Swedish:</p>\n\n<p>The models were created with data from the Kubhist 2 corpus (Spr&aring;kbanken) -- more precisely, the data dumps available at&nbsp;<a href=\"https://spraakbanken.gu.se/lb/resurser/meningsmangder/\">https://spraakbanken.gu.se</a>. After a manual evaluation of Swedish embeddings trained without pre-processing seemed to show that the embeddings were of low quality, we retrained models, only keeping sentences that were at least 10 tokens long and were constituted of at least 50% of lemmas as per the KORP processing pipeline (Borin et al, 2012).</p>\n\n<p>Filesizes:</p>\n\n<pre><code>[simon@taito-login3 SGNS]$ du -h sv*\n1.6M    sv_1740_SGNS_corpus_file.gensim\n44M sv_1760_SGNS_corpus_file.gensim\n124M    sv_1780_SGNS_corpus_file.gensim\n228M    sv_1800_SGNS_corpus_file.gensim\n678M    sv_1820_SGNS_corpus_file.gensim\n1.6G    sv_1840_SGNS_corpus_file.gensim\n4.5G    sv_1860_SGNS_corpus_file.gensim\n6.5G    sv_1880_SGNS_corpus_file.gensim\n113M    sv_1900_SGNS_corpus_file.gensim</code></pre>\n\n<p>Dutch:</p>\n\n<p>The models were created with data from the Delpher newspaper archive (Royal Dutch Library, 2017), through data dumps for newspapers until and including 1876, and through API hits for articles from 1877 to 1899 (included).</p>\n\n<ul>\n\t<li>For anything pre-1877 we discarded full texts that had, in the metadata, anything else than exclusively&nbsp;<code>nl</code>&nbsp;or&nbsp;<code>NL</code>&nbsp;as a language tag.</li>\n\t<li>For the full texts between 1877 and 1899: we queried the API for all items in the &ldquo;artikel&rdquo; category that contained the determiner&nbsp;<code>de</code>.</li>\n</ul>\n\n<p>Our assumption was that most articles should contain&nbsp;<code>de</code>&nbsp;at least once, and those that didn&#39;t were too short to be deemed interesting. A subsequent study showed that was not exactly the case, but we were reassured by the fact that left-out articles were probably &quot;shipping or financial reports&quot; (thanks go to Melvin Wevers). We also did not include the colonial newspapers for our embeddings. This is motivated by our research questions. A list of removed newspapers is available on request.</p>\n\n<p>Filesizes:</p>\n\n<pre><code>[simon@taito-login3 SGNS]$ du -h nl*\n6.8M    nl_1620_SGNS_corpus_file.gensim\n7.9M    nl_1640_SGNS_corpus_file.gensim\n43M nl_1660_SGNS_corpus_file.gensim\n78M nl_1680_SGNS_corpus_file.gensim\n138M    nl_1700_SGNS_corpus_file.gensim\n243M    nl_1720_SGNS_corpus_file.gensim\n287M    nl_1740_SGNS_corpus_file.gensim\n431M    nl_1760_SGNS_corpus_file.gensim\n825M    nl_1780_SGNS_corpus_file.gensim\n1.2G    nl_1800_SGNS_corpus_file.gensim\n1.8G    nl_1820_SGNS_corpus_file.gensim\n3.1G    nl_1840_SGNS_corpus_file.gensim\n5.2G    nl_1860_SGNS_corpus_file.gensim\n13G nl_1880_SGNS_corpus_file.gensim</code></pre>\n\n<p>English:</p>\n\n<p>The models were created with data from the British Library Newspapers collection (<a href=\"https://www.gale.com/intl/primary-sources/british-library-newspapers%5D\">link</a>), the Nichols collection (<a href=\"https://www.gale.com/intl/c/17th-and-18th-century-burney-newspapers-collection\">link</a>), and the Burney collection (<a href=\"https://www.gale.com/intl/c/17th-and-18th-century-nichols-newspapers-collection\">link</a>). We used everything in the corpora. For English, only SGNS_ALIGN models are available. We thank Gale Cengage for their help with this project.</p>\n\n<p>Filesizes:</p>\n\n<pre><code>[simon@taito-login3 SGNS]$ du -h en*\n4.3M    en_1620_SGNS_corpus_file.gensim\n11M en_1640_SGNS_corpus_file.gensim\n11M en_1660_SGNS_corpus_file.gensim\n106M    en_1680_SGNS_corpus_file.gensim\n409M    en_1700_SGNS_corpus_file.gensim\n1.7G    en_1720_SGNS_corpus_file.gensim\n834M    en_1740_SGNS_corpus_file.gensim\n2.4G    en_1760_SGNS_corpus_file.gensim\n5.3G    en_1780_SGNS_corpus_file.gensim\n5.5G    en_1800_SGNS_corpus_file.gensim\n15G en_1820_SGNS_corpus_file.gensim\n42G en_1840_SGNS_corpus_file.gensim\n65G en_1860_SGNS_corpus_file.gensim\n88G en_1880_SGNS_corpus_file.gensim\n26G en_1900_SGNS_corpus_file.gensim\n21G en_1920_SGNS_corpus_file.gensim\n6.3G    en_1940_SGNS_corpus_file.gensim</code></pre>\n\n<p><strong>Word embeddings</strong></p>\n\n<p>For every language, we train diachronic embeddings as follows. We divide the data in 20-year time bins. We train SGNS_UPDATE and SGNS_ALIGN models. Current research on German (Schlechtweg et al, 2019) and English (Shoemark et al, 2019) indicates you should use the SGNS_ALIGN models.&nbsp;<strong>For EN, FI, NL, no tokens (including punctuation) were removed nor altered, aside from lowercasing</strong>. For SV, see above. Parameters are as follows: SGNS architecture (Mikolov et al 2013), window size of 5, frequency threshold of 100, 5 epochs, 300 dimensions (or 100 for EN).</p>\n\n<ul>\n\t<li>For SGNS_UPDATE: We first train a model for the first time bin&nbsp;<code>t</code>. To train the model for&nbsp;<code>t+1</code>, we use the&nbsp;<code>t</code>&nbsp;model to initialise the vectors for&nbsp;<code>t+1</code>, set the learning rate to correspond to the end learning rate of&nbsp;<code>t</code>, and continue training. This approach, closely following Kim et al (2014), has the advantage of avoiding the need for post-training vector space alignment.</li>\n</ul>\n\n<p>The Python snippet below, which makes use of gensim (Rehurek and Sojka, 2010), illustrates the approach. Special thanks go to Sara Budts.</p>\n\n<pre><code>## dict_files[key] is a dictionary with double decades as keys and a corresponding LineSentence object as value: https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.LineSentence\n\ncount = 0\nfor key in sorted(list(dict_files.keys())):\n\n    if count == 0: ## This is the first model.\n\n        model = gensim.models.Word2Vec(corpus_file=dict_files[key], min_count=100, sg=1 ,size=300, workers=64, seed=1830, iter=5)\n        model.save(os.path.join(data_path_final,\"KIM\",lang+\"_\"+str(timebin)+\".w2v\"))\n        print(\"Model saved, on to the next\\n\")\n        count += 1\n\n\n    if count &gt; 0: ## this is for the subsequent models.\n        print(\"model for double decade starting in\",str(key))\n        model = gensim.models.Word2Vec.load(os.path.join(data_path_final,\"KIM\",lang+\"_\"+str(timebin-20)+\".w2v\"))\n        print(\"previous model loaded\")\n\n        model.build_vocab(corpus_file=dict_files[key], update=True)\n        model.train(corpus_file=dict_files[key], total_words = model.corpus_count, total_examples = model.corpus_count, start_alpha = model.alpha, end_alpha = model.min_alpha, epochs=model.epochs)\n        model.save(os.path.join(data_path_final,\"KIM\",lang+\"_\"+str(timebin)+\".w2v\"))\n</code></pre>\n\n<ul>\n\t<li>For SGNS_ALIGN: We independently train models for all time bins. The models in this repository are&nbsp;<em>NOT</em>&nbsp;aligned, leaving you the choice of how to align them. For example,&nbsp;<a href=\"https://gist.github.com/quadrismegistus/09a93e219a6ffc4f216fb85235535faf\">here</a>&nbsp;is a link to code by Ryan Heuser to do just that. Models were trained with the&nbsp;<code>count == 0</code>&nbsp;scenario in the snippet above.</li>\n</ul>\n\n<p><strong>Acknowledgments</strong></p>\n\n<p>This work has been supported by the European Union&#39;s Horizon 2020 research and innovation programme under grant 770299&nbsp;<a href=\"https://www.newseye.eu/\">NewsEye</a>. Specials thanks go to the data providers/collection-holding institutions: the Finnish Language Bank, the Swedish Language Bank, the Royal Dutch Library, and Gale Cengage.</p>\n\n<p>The authors would like to thank the following persons and group, listed alphabetically: Antoine Doucet, Antti Kanner, Axel-Jean Caurant, Dominik Schlechtweg, Eetu M&auml;kel&auml;, Elaine Zosa, Estelle Bunout, Haim Dubossarsky, Joris van Eijnatten, Krister Lind&eacute;n, Lars Borin, Lidia Pivovarova, Melvin Wevers, Nina Tahmasebi, Sara Budts, Senka Drobac, Tanja S&auml;ily, the COMHIS group, and Steven Claeyssens. Computational resources were provided by CSC &ndash; IT Center for Science Ltd.</p>\n\n<p><strong>References</strong></p>\n\n<p>Borin, L., Forsberg, M., Roxendal, J. (2012). Korp-the corpus infrastructure of Spr&auml;kbanken,in: LREC. pp. 474&ndash;478.</p>\n\n<p>Kim, Y., Chiu, Y.I., Hanaki, K., Hegde, D. and Petrov, S. (2014). Temporal Analysis of Language through Neural Language Models.&nbsp;<em>ACL 2014</em>, p.61.</p>\n\n<p>Mikolov, T., Chen, K., Corrado, G. and Dean, J. (2013). Efficient estimation of word representations in vector space.&nbsp;<em>arXiv preprint arXiv:1301.3781</em>.</p>\n\n<p>National Library of Finland (2011).&nbsp;<em>The Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland, Kielipankki Version</em>&nbsp;[text corpus]. Kielipankki. Retrieved from&nbsp;<a href=\"http://urn.fi/urn:nbn:fi:lb-2016050302\">http://urn.fi/urn:nbn:fi:lb-2016050302</a>.</p>\n\n<p>Rehurek, R. and Sojka, P. (2010). Software framework for topic modelling with large corpora. In&nbsp;<em>Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks</em>.</p>\n\n<p>Royal Dutch Library (2017).&nbsp;<em>Delpher open krantenarchief (1.0)</em>. Den Haag, 2017.</p>\n\n<p>Schlechtweg D., H&auml;tty A, del Tredici M., and Schulte im Walde S. (2019). A Wind of Change: Detecting and Evaluating Lexical Semantic Change across Times and Domains. In&nbsp;<em>Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)</em>, Florence, Italy. ACL.</p>\n\n<p>Shoemark, P., Liza, F.F., Nguyen, D., Hale, S. and McGillivray, B. (2019). Room to Glo: A Systematic Comparison of Semantic Change Detection Approaches with Word Embeddings. In&nbsp;<em>Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (pp. 66-76)</em>, Hong Kong.</p>\n\n<p>Spr&aring;kbanken.&nbsp;<em>The Kubhist Corpus</em>. Department of Swedish, University of Gothenburg.&nbsp;<a href=\"https://spraakbanken.gu.se/korp/?mode=kubhist\">https://spraakbanken.gu.se/korp/?mode=kubhist</a>.</p>", 
  "license": "https://creativecommons.org/licenses/by/4.0/legalcode", 
  "creator": [
    {
      "affiliation": "University of Helsinki", 
      "@id": "https://orcid.org/0000-0002-8453-7221", 
      "@type": "Person", 
      "name": "Hengchen, Simon"
    }, 
    {
      "affiliation": "Utrecht University", 
      "@id": "https://orcid.org/0000-0002-5303-2861", 
      "@type": "Person", 
      "name": "Ros, Ruben"
    }, 
    {
      "affiliation": "University of Helsinki", 
      "@id": "https://orcid.org/0000-0002-3085-4862", 
      "@type": "Person", 
      "name": "Marjanen, Jani"
    }, 
    {
      "affiliation": "University of Helsinki", 
      "@id": "https://orcid.org/0000-0003-2892-8911", 
      "@type": "Person", 
      "name": "Tolonen, Mikko"
    }
  ], 
  "url": "https://zenodo.org/record/3585027", 
  "datePublished": "2019-12-19", 
  "version": "1.0.0", 
  "keywords": [
    "word embeddings", 
    "newspapers"
  ], 
  "@context": "https://schema.org/", 
  "distribution": [
    {
      "contentUrl": "https://zenodo.org/api/files/7682a4bf-2e26-4fa1-a78b-cafb6ea0f4dc/diachronic_embeddings_hengchen-ros-marjanen-tolonen-2019-12-19.tar", 
      "encodingFormat": "tar", 
      "@type": "DataDownload"
    }
  ], 
  "identifier": "https://doi.org/10.5281/zenodo.3585027", 
  "@id": "https://doi.org/10.5281/zenodo.3585027", 
  "@type": "Dataset", 
  "name": "Models for \"A data-driven approach to studying changing vocabularies in historical newspaper collections\""
}
2,478
856
views
downloads
All versions This version
Views 2,4781,764
Downloads 856810
Data volume 14.0 TB13.9 TB
Unique views 2,1361,525
Unique downloads 810780

Share

Cite as