Journal article Open Access
Hengchen, Simon;
Tahmasebi, Nina
<?xml version='1.0' encoding='UTF-8'?> <record xmlns="http://www.loc.gov/MARC21/slim"> <leader>00000nam##2200000uu#4500</leader> <datafield tag="041" ind1=" " ind2=" "> <subfield code="a">swe</subfield> </datafield> <controlfield tag="005">20210127154053.0</controlfield> <controlfield tag="001">4301658</controlfield> <datafield tag="700" ind1=" " ind2=" "> <subfield code="u">University of Gothenburg</subfield> <subfield code="0">(orcid)0000-0003-1688-1845</subfield> <subfield code="a">Tahmasebi, Nina</subfield> </datafield> <datafield tag="856" ind1="4" ind2=" "> <subfield code="s">16249535451</subfield> <subfield code="z">md5:26c174380d0c9c4bdf552bbe0e5ef325</subfield> <subfield code="u">https://zenodo.org/record/4301658/files/HENGCHEN-TAHMASEBI_-_2020_-_Kubhist2_diachronic_embeddings.zip</subfield> </datafield> <datafield tag="542" ind1=" " ind2=" "> <subfield code="l">open</subfield> </datafield> <datafield tag="260" ind1=" " ind2=" "> <subfield code="c">2020-12-01</subfield> </datafield> <datafield tag="909" ind1="C" ind2="O"> <subfield code="p">openaire</subfield> <subfield code="o">oai:zenodo.org:4301658</subfield> </datafield> <datafield tag="909" ind1="C" ind2="4"> <subfield code="p">Journal of Open Humanities Data</subfield> </datafield> <datafield tag="100" ind1=" " ind2=" "> <subfield code="u">University of Gothenburg</subfield> <subfield code="0">(orcid)0000-0002-8453-7221</subfield> <subfield code="a">Hengchen, Simon</subfield> </datafield> <datafield tag="245" ind1=" " ind2=" "> <subfield code="a">A collection of Swedish diachronic word embedding models trained on historical newspaper data</subfield> </datafield> <datafield tag="540" ind1=" " ind2=" "> <subfield code="u">https://creativecommons.org/licenses/by/4.0/legalcode</subfield> <subfield code="a">Creative Commons Attribution 4.0 International</subfield> </datafield> <datafield tag="650" ind1="1" ind2="7"> <subfield code="a">cc-by</subfield> <subfield code="2">opendefinition.org</subfield> </datafield> <datafield tag="520" ind1=" " ind2=" "> <subfield code="a"><p><em><strong>A collection of Swedish diachronic word embedding models trained on historical newspaper data</strong></em></p> <p>Simon Hengchen, Nina Tahmasebi</p> <p><em>NOTE: this README.md&nbsp;is a summary. For all details, see the paper at&nbsp;<a href="https://doi.org/10.5334/johd.22">https://doi.org/10.5334/johd.22</a></em></p> <p><em>NOTE: this data release is available on Zenodo at&nbsp;<a href="https://zenodo.org/record/4301658">https://zenodo.org/record/4301658</a></em></p> <p><strong>Description</strong></p> <p>This is the data release accompanying the Journal of Open Humanities Data paper &quot;A collection of Swedish diachronic word embedding models trained on historical newspaper data.&quot; This paper describes the creation of several word embedding models based on a large collection of diachronic Swedish newspaper material available through Spr&aring;kbanken Text, the Swedish language bank. This data was produced in the context of Spr&aring;kbanken Text&#39;s continued mission to collaborate with humanities and natural language processing researchers and to provide freely available language resources, for the development of state-of-the-art NLP methods and tools.</p> <p><strong>Bibtex</strong></p> <p>If you use the models or the code provided in this paper, please cite the following:</p> <pre><code>@article{hengchen-tahmasebi-2021-collection, title = "A collection of {S}wedish diachronic word embedding models trained on historical newspaper data", author = "Hengchen, Simon and Tahmasebi, Nina", journal = "Journal of Open Humanities Data", year = "2021", pages = {1--7}, volume = {7}, number = {2}, doi = {10.5334/johd.22} } </code></pre> <p><strong>Overview</strong></p> <p>We release diachronic word2vec and fastText models in their skip-gram with negative sampling (SGNS) architecture. The models are trained on 20-year time bins, with two temporal alignment strategies: independently-trained models for post-hoc alignment, and incremental training. The independently-trained models are NOT aligned, leaving the&nbsp;<a href="https://github.com/Garrafao/LSCDetection/tree/master/alignment">choice of alignment</a>&nbsp;to the end user. We provide code to reproduce our pipeline, and code examples to load and use the models.</p> <p><strong>Data</strong></p> <p>The entirety of the Kunglinga bibliotekets historiska tidningar (Kubhist 2) corpus was used. The original data was scanned and OCRed by the National Library of Sweden. It consists of Swedish newspapers from all parts of Sweden. It has since been run through the Sparv annotation pipeline by Martin Hammarstedt at Spr&aring;kbanken Text.</p> <p><strong>Preprocessing</strong></p> <p>The text was retrieved from the original XML. The processing steps prior to training the models are:</p> <ul> <li>lowercasing</li> <li>removal of digits</li> <li>removal of all characters not belonging to the Swedish alphabet (a-z&auml;&aring;&ouml;)</li> <li>removal of tokens the length of which is two characters or smaller</li> <li>merging of all texts pertaining to the same double decade (1740-1759; 1760-1779; ...)</li> </ul> <p><strong>Quality control</strong></p> <p>All models have been queried for some control analogies by a native speaker of Swedish. A (non-native speaker of Swedish) reviewer, whom we thank, also performed checks on the local neighbourhoods of selected terms, performed vector arithmetics, and confirmed the models behaved as expected.</p> <p><strong>Structure</strong></p> <pre><code>ROOT/ README.md code/ *.py files requirements.txt fasttext/ incremental/ *.ft files *.npy files indep/ *.ft files *.npy files word2vec/ incremental/ *.w2v files *.npy files indep/ *.w2v files *.npy files </code></pre> <p>Regarding the code:</p> <ul> <li><code>kubhist_XML_to_gensim.py</code>&nbsp;will transform the XML into &quot;LineSentence&quot;, &quot;clean&quot; corpora</li> <li><code>train_w2v-ft.py</code>&nbsp;will train models</li> <li><code>load_run_models.py</code>&nbsp;will print some examples of what can be done with embeddings</li> <li><code>utils.py</code>&nbsp;contains the functions called by the scripts above</li> <li><code>requirements.txt</code>&nbsp;contains the output of&nbsp;<code>pip freeze &gt; requirements.txt</code>, i.e. the python libraries needed to run the scripts above</li> </ul> <p><strong>Funding</strong></p> <p>This work has been funded in part by the project&nbsp;<a href="https://languagechange.org/"><em>Towards Computational Lexical Semantic Change Detection</em></a>&nbsp;supported by the Swedish Research Council (2019--2022; dnr 2018-01184), and&nbsp;<em>Nationella Spr&aring;kbanken</em>&nbsp;(the Swedish National Language Bank) -- jointly funded by the Swedish Research Council (2018--2024; dnr 2017-00626) and its 10 partner institutions, to NT.</p></subfield> </datafield> <datafield tag="773" ind1=" " ind2=" "> <subfield code="n">doi</subfield> <subfield code="i">isVersionOf</subfield> <subfield code="a">10.5281/zenodo.4274481</subfield> </datafield> <datafield tag="024" ind1=" " ind2=" "> <subfield code="a">10.5281/zenodo.4301658</subfield> <subfield code="2">doi</subfield> </datafield> <datafield tag="980" ind1=" " ind2=" "> <subfield code="a">publication</subfield> <subfield code="b">article</subfield> </datafield> </record>
All versions | This version | |
---|---|---|
Views | 7,949 | 7,898 |
Downloads | 3,350 | 3,350 |
Data volume | 54.4 TB | 54.4 TB |
Unique views | 7,713 | 7,688 |
Unique downloads | 3,331 | 3,331 |