{
  "DOI": "10.5281/zenodo.3270648",
  "abstract": "NOTE: This is a badly rendered version of the README within the archive.\n\n\nA data-driven approach to the changing vocabulary of the \u2018nation\u2019 in English, Dutch, Swedish and Finnish newspapers, 1750-1950\n\n\nSimon Hengchen*, Ruben Ros**, Jani Marjanen*\n\n\n*University of Helsinki:\u00a0firstname.lastname@helsinki.fi; ** Utrecht University:\u00a0r.s.lastname@students.uu.nl\n\n\nThese are the supplementary materials for the DH2019 paper\u00a0A data-driven approach to the changing vocabulary of the \u2018nation\u2019 in English, Dutch, Swedish and Finnish newspapers, 1750-1950. If you end up using whole or parts of this resource, please use the following citation:\n\n\n\n\t\nHengchen, S., Ros, R., and Marjanen, J. (2019). A data-driven approach to the changing vocabulary of the \u2018nation\u2019 in English, Dutch, Swedish and Finnish newspapers, 1750-1950. In\u00a0Proceedings of the Digital Humanities (DH) conference 2019, Utrecht, The Netherlands,\n\tor alternatively use the following\u00a0bib:\n\n\n\n@inproceedings{hengchen2019nation,\n title=\"A data-driven approach to the changing vocabulary of the 'nation' in {E}nglish, {D}utch, {S}wedish and {F}innish newspapers, 1750-1950.\",\n author={Hengchen, Simon, and Ros, Ruben, and Marjanen, Jani},\n year={2019},\n booktitle={Proceedings of the Digital Humanities (DH) conference 2019, Utrecht, The Netherlands}\n }\n\n\n\nFiles\n\n\nThis archive contains three folders \u2013 one per language \u2013 as well as this README. The folders contain the models for their respective languages. This work is licensed under a\u00a0Creative Commons Attribution-ShareAlike 4.0 International License.\n\n\nSource material\n\n\nFinnish:\n\n\nThe models were created with data from the Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland (National Library of Finland, 2011). We used everything in the corpus.\n\n\nSwedish:\n\n\nThe models were created with data from the Kubhist corpus (Spr\u00e5kbanken) \u2013 more precisely, the data dumps available at\u00a0https://spraakbanken.gu.se/swe/resurser/. We used everything in the corpus.\n\n\nDutch:\n\n\nThe models were created with data from the Delpher newspaper archive (Royal Dutch Library, 2017), through data dumps for newspapers until and including 1876, and through API hits for articles from 1877 to 1899 (included).\n\n\n\n\t\nFor anything pre-1877 we discarded full texts that had, in the metadata, anything else than exclusively\u00a0nl\u00a0or\u00a0NL\u00a0as a language tag.\n\t\nFor the full texts between 1877 and 1899: we queried the API for all items in the \u201cartikel\u201d category that contained the determiner\u00a0de.\n\n\n\nOur assumption was that most articles should contain\u00a0de\u00a0at least once, and those that didn\u2019t were too short to be deemed interesting. A subsequent study showed that was not exactly the case, but we were reassured by the fact that left-out articles were probably \u201cshipping or financial reports\u201d (thanks go to Melvin Wevers).\nWe also did not include the colonial newspapers for our embeddings. This is motivated by two reasons, important in the context of our research question: first, only the Dutch dataset has an extensive coverage of colonial newspapers \u2013 including them would have weakened our comparisons with the other countries in our studies. Second, Dutch colonial newspapers \u201cshowed a great uniformity\u201d because \u201ctheir news supply was unique and controlled by the official news agency, ANETA\u201d. (Our translation and paraphrasing of Witte 1998:18). A list of removed newspapers is available on request.\n\n\nEnglish:\n\n\nThe embeddings are not available at this moment. Do get in touch with Simon if you want to receive an email when they become available.\n\n\nWord embeddings\n\n\nFor every language, we train diachronic embeddings as follows. We divide the data in 20-year time bins. We first train a model for the first time bin\u00a0t. To train the model for\u00a0t+1, we use the\u00a0t\u00a0model to initialise the vectors for\u00a0t+1, set the learning rate to correspond to the end learning rate of\u00a0t, and continue training. This approach, closely following Kim et al (2014), has the advantage of avoiding the need for post-training vector space alignment.\nParameters are as follows: CBOW architecture (Mikolov et al 2013), window size of 5, frequency threshold of 100, 5 epochs.\u00a0No tokens (including punctuation) were removed nor altered, aside from lowercasing.\n\n\nThe Python snippet below, which makes use of gensim (Rehurek and Sojka, 2010), illustrates the approach. Special thanks go to Sara Budts.\n\n\nlang = \"fi\" ## change this\nclass MySentences(object): ## This class has been directly copied from gensim.\n    def __init__(self, liste):\n        self.liste = liste\n    def __iter__(self):\n        for file in self.liste:\n             with open(file, \"r\") as fname: ## Our files are line-separated, with tab-separated, lowercased tokens\n                for line in fname:\n                    yield line.split(\"\\t\")\n        \ncount = 0                    \nfor key in sorted(list(dict_files.keys())): # dict_files is a dict with time bins as keys, and lists of filepaths as values\n    number_of_files = len(dict_files[key])\n    print(\"Number of files for double decade starting in\",str(key),\"is\",str(number_of_files))\n    if os.path.exists(dir_out+\"/model_\"+lang+\"_\"+str(key)+\".w2v\") == False:\n        print(\"model_\"+lang+\"_\"+str(key)+\" does not exist, running\")\n        if number_of_files > 0:\n            if count == 0: ## This is the first model.\n                count += 1\n                sentences = MySentences(dict_files[key])\n                model = gensim.models.Word2Vec(sentences, min_count=100, workers=14, seed=1830, epochs=5)\n                model.save(dir_out+\"/model_\"+lang+\"_\"+str(key)+\".w2v\")\n                print(\"Model saved, on to the next\\n\")\n            if count > 0: ## this is for the subsequent models.\n                print(\"model for double decade starting in\",str(key))\n                model = gensim.models.Word2Vec.load(dir_out+\"/model_\"+lang+\"_\"+str(key-20)+\".w2v\") ## If the script crashes, we make sure to have the latest model.\n                sentences = MySentences(dict_files[key])\n                model.build_vocab(sentences, update=True)\n                model.train(sentences, total_examples = model.corpus_count, start_alpha = model.alpha, end_alpha = model.min_alpha, epochs = model.iter)\n                model.save(dir_out+\"/model_\"+lang+\"_\"+str(key)+\".w2v\")\n                print(\"Model saved, on to the next\\n\")\n\n\n\n\nAcknowledgments\n\n\nPart of this work has been supported by the European Union\u2019s Horizon 2020 research and innovation programme under grant 770299\u00a0NewsEye.\nThe authors would like to thank the following persons and institutions, listed in random order:\nSteven Claeyssens, Mikko Tolonen, Nina Tahmasebi, Melvin Wevers, Eetu M\u00e4kel\u00e4, the COMHIS group, Lidia Pivovarova, Antti Kanner, Senka Drobac, Krister Lind\u00e9n, Lars Borin, Dominik Schlechtweg, Sara Budts, Haim Dubossarsky, Estelle Bunout, Axel-Jean Caurant, Tanja S\u00e4ily, Elaine Zosa, and Joris van Eijnatten.\n\n\nReferences\n\n\nKim, Y., Chiu, Y.I., Hanaki, K., Hegde, D. and Petrov, S. (2014). Temporal Analysis of Language through Neural Language Models.\u00a0ACL 2014, p.61.\nMikolov, T., Chen, K., Corrado, G. and Dean, J. (2013). Efficient estimation of word representations in vector space.\u00a0arXiv preprint arXiv:1301.3781.\nNational Library of Finland (2011).\u00a0The Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland, Kielipankki Version\u00a0[text corpus]. Kielipankki. Retrieved from\u00a0http://urn.fi/urn:nbn:fi:lb-2016050302.\nRehurek, R. and Sojka, P. (2010). Software framework for topic modelling with large corpora. In\u00a0Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks.\nRoyal Dutch Library (2017).\u00a0Delpher open krantenarchief (1.0). Den Haag, 2017.\nSpr\u00e5kbanken.\u00a0The Kubhist Corpus. Department of Swedish, University of Gothenburg.\u00a0https://spraakbanken.gu.se/korp/?mode=kubhist.\nWitte, R. (1998).\u00a0De Indische radio-omroep: overheidsbeleid en ontwikkeling, 1923-1942. Uitgeverij Verloren.",
  "author": [
    {
      "family": "Hengchen",
      "given": "Simon"
    },
    {
      "family": "Ros",
      "given": "Ruben"
    },
    {
      "family": "Marjanen",
      "given": "Jani"
    }
  ],
  "event": "Digital Humanities (DH2019)",
  "event_place": "Utrecht",
  "id": "3270648",
  "issued": {
    "date-parts": [
      [
        "2019",
        "07",
        "06"
      ]
    ]
  },
  "language": "eng",
  "publisher": "Zenodo",
  "title": "Models for \"A data-driven approach to the changing vocabulary of the 'nation' in English, Dutch, Swedish and Finnish newspapers, 1750-1950\"",
  "type": "dataset",
  "version": "1.0.0"
}