Data for "SuperSim: a test set for word similarity and relatedness in Swedish"
Description
This repository contains the data described in SuperSim: a test set for word similarity and relatedness in Swedish (Hengchen and Tahmasebi, 2021) available at https://aclanthology.org/2021.nodalida-main.27/ . If you use part or whole of this resource, please cite the following work or alternatively use the bibtex entry:
Hengchen, Simon and Tahmasebi, Nina, 2021. SuperSim: a test set for word similarity and relatedness in Swedish. In The 23rd Nordic Conference on Computational Linguistics (NoDaLiDa’21).
@inproceedings{hengchen-tahmasebi-2021-supersim,
title = "{SuperSim:} a test set for word similarity and relatedness in {Swedish}",
author = "Hengchen, Simon and
Tahmasebi, Nina",
booktitle = "Proceedings of the 23rd Nordic Conference on Computational Linguistics",
month = may # "{--}" # jun,
year = "2021",
address = "Reykjavik, Iceland, and Online",
publisher = {Link{\"o}ping University Electronic Press},
}
The data contained in this repository is as follows:
The code
folder contains:
-
main.py
-
utils.py
-
train_base_models.py
-
perl-clean.pl
-
requirements.txt
The data
folder contains:
-
gold_relatedness.tsv
: all relatedness judgments from all annotators, as well as the mean -
gold_similarity.tsv
: all similarity judgments from all annotators, as well as the mean -
models
contains baseline models:- Trained on the Swedish Gigaword:
- FastText:
gigaword_sv.ft
(andgigaword_sv.ft.trainables.syn1neg.npy
,gigaword_sv.ft.trainables.vectors_ngrams_lockf.npy
,gigaword_sv.ft.trainables.vectors_vocab_lockf.npy
,gigaword_sv.ft.wv.vectors_ngrams.npy
,gigaword_sv.ft.wv.vectors_vocab.npy
,gigaword_sv.ft.wv.vectors.npy
) - Word2Vec:
gigaword_sv.w2v
(andgigaword_sv.w2v.trainables.syn1neg.npy
,gigaword_sv.w2v.wv.vectors.npy
) - GloVe:
glove_vectors_giga.txt
andglove_vocab_giga.txt
- FastText:
- Trained on Swedish Wikipedia:
- FastText:
wiki_sv.ft
(andwiki_sv.ft.trainables.syn1neg.npy
,wiki_sv.ft.trainables.vectors_ngrams_lockf.npy
,wiki_sv.ft.trainables.vectors_vocab_lockf.npy
,wiki_sv.ft.wv.vectors_ngrams.npy
,wiki_sv.ft.wv.vectors.npy
,wiki_sv.ft.wv.vectors_vocab.npy
) - Word2Vec:
wiki_sv.w2v
(andwiki_sv.w2v.trainables.syn1neg.npy
,wiki_sv.w2v.wv.vectors.npy
) - GloVe:
glove_vectors_WIKI.txt
andglove_vocab_WIKI.txt
- FastText:
- Trained on the Swedish Gigaword:
corpora
:- The Swedish Gigaword corpus can be downloaded, along with code, from: https://spraakbanken.gu.se/en/resources/gigaword. We created our corpus with
python extract_bw.py --mode plain outfile.txt
. sv_wiki.gensim
is a cleaned Swedish Wikipedia dump from 2020/10/20 (originallysvwiki-20201020-pages-articles.xml
) and one of our baseline corpora.
- The Swedish Gigaword corpus can be downloaded, along with code, from: https://spraakbanken.gu.se/en/resources/gigaword. We created our corpus with
Details on annotation procedures are available in the paper.
Acknowledgments:
This work has been funded in part by the project Towards Computational Lexical Semantic Change Detection supported by the Swedish Research Council (2019--2022; dnr 2018-01184), and Nationella Språkbanken (the Swedish National Language Bank), jointly funded by the Swedish Research Council (2018--2024; dnr 2017-00626) and its ten partner institutions.
Files
SuperSim-Hengchen-Tahmasebi.zip
Files
(2.9 GB)
Name | Size | Download all |
---|---|---|
md5:0e53797cf28fdced9a04505d446d7d41
|
2.9 GB | Preview Download |