Data for "SuperSim: a test set for word similarity and relatedness in Swedish"

doi:10.5281/zenodo.4660084

Published April 2, 2021 | Version v1

Dataset Open

Data for "SuperSim: a test set for word similarity and relatedness in Swedish"

1. University of Gothenburg

This repository contains the data described in SuperSim: a test set for word similarity and relatedness in Swedish (Hengchen and Tahmasebi, 2021) available at https://aclanthology.org/2021.nodalida-main.27/ . If you use part or whole of this resource, please cite the following work or alternatively use the bibtex entry:

Hengchen, Simon and Tahmasebi, Nina, 2021. SuperSim: a test set for word similarity and relatedness in Swedish. In The 23rd Nordic Conference on Computational Linguistics (NoDaLiDa’21).

@inproceedings{hengchen-tahmasebi-2021-supersim,
    title = "{SuperSim:} a test set for word similarity and relatedness in {Swedish}",
    author = "Hengchen, Simon and
      Tahmasebi, Nina",
    booktitle = "Proceedings of the 23rd Nordic Conference on Computational Linguistics",
    month = may # "{--}" # jun,
    year = "2021",
    address = "Reykjavik, Iceland, and Online",
    publisher = {Link{\"o}ping University Electronic Press},
    }

The data contained in this repository is as follows:

The code folder contains:

main.py
utils.py
train_base_models.py
perl-clean.pl
requirements.txt

The data folder contains:

gold_relatedness.tsv: all relatedness judgments from all annotators, as well as the mean
gold_similarity.tsv: all similarity judgments from all annotators, as well as the mean
models contains baseline models:
1. Trained on the Swedish Gigaword:
  1. FastText: gigaword_sv.ft (and gigaword_sv.ft.trainables.syn1neg.npy, gigaword_sv.ft.trainables.vectors_ngrams_lockf.npy, gigaword_sv.ft.trainables.vectors_vocab_lockf.npy, gigaword_sv.ft.wv.vectors_ngrams.npy, gigaword_sv.ft.wv.vectors_vocab.npy, gigaword_sv.ft.wv.vectors.npy)
  2. Word2Vec: gigaword_sv.w2v (and gigaword_sv.w2v.trainables.syn1neg.npy, gigaword_sv.w2v.wv.vectors.npy)
  3. GloVe: glove_vectors_giga.txt and glove_vocab_giga.txt
2. Trained on Swedish Wikipedia:
  1. FastText: wiki_sv.ft (and wiki_sv.ft.trainables.syn1neg.npy, wiki_sv.ft.trainables.vectors_ngrams_lockf.npy, wiki_sv.ft.trainables.vectors_vocab_lockf.npy, wiki_sv.ft.wv.vectors_ngrams.npy, wiki_sv.ft.wv.vectors.npy, wiki_sv.ft.wv.vectors_vocab.npy)
  2. Word2Vec: wiki_sv.w2v (and wiki_sv.w2v.trainables.syn1neg.npy, wiki_sv.w2v.wv.vectors.npy)
  3. GloVe: glove_vectors_WIKI.txt and glove_vocab_WIKI.txt
corpora:
1. The Swedish Gigaword corpus can be downloaded, along with code, from: https://spraakbanken.gu.se/en/resources/gigaword. We created our corpus with python extract_bw.py --mode plain outfile.txt.
2. sv_wiki.gensim is a cleaned Swedish Wikipedia dump from 2020/10/20 (originally svwiki-20201020-pages-articles.xml) and one of our baseline corpora.

Details on annotation procedures are available in the paper.

Acknowledgments:

This work has been funded in part by the project Towards Computational Lexical Semantic Change Detection supported by the Swedish Research Council (2019--2022; dnr 2018-01184), and Nationella Språkbanken (the Swedish National Language Bank), jointly funded by the Swedish Research Council (2018--2024; dnr 2017-00626) and its ten partner institutions.

Files

SuperSim-Hengchen-Tahmasebi.zip

Files (2.9 GB)

Name	Size	Download all
SuperSim-Hengchen-Tahmasebi.zip md5:0e53797cf28fdced9a04505d446d7d41	2.9 GB	Preview Download

	All versions	This version
Views	2,824	2,821
Downloads	3,267	3,267
Data volume	9.7 TB	9.7 TB

Data for "SuperSim: a test set for word similarity and relatedness in Swedish"

Creators

Description

Files

SuperSim-Hengchen-Tahmasebi.zip

Files (2.9 GB)