Dataset Open Access
Alexander Panchenko; Nikolay Arefyev; Dmitry Ustalov; Natalia Loukachevitch; Denis Paperno; Chris Biemann; Natalia Konstantinova
This resource is a part of the Russian Distributional Thesaurus (RDT): see http://russe.nlpub.ru/downloads and http://nlpub.ru/RDT.
This dataset contains a large scale word embeddings model for Russian trained using the SGNS model (Mikolov et al., 2013) on a 12.9 billion word collection of books in Russian. According to the results of our participation in the shared task on Russian semantic similarity (Panchenko et al., 2015), this approach scored in the top 5 among 105 submissions (Arefyev et al., 2015). Following our prior experiments (Arefyev et al., 2015) we have selected the following parameters for the model: minimal word frequency – 5, number of dimensions in a word vector – 500, three or five iterations of the learning algorithm over the input corpus, context window size of 1, 2, 3, 5, 7 and 10 words. Parameters of the model are listed below:
References:
Name | Size | |
---|---|---|
all.norm-sz500-w10-cb0-it3-min5.w2v
md5:df74dbbbf003ade410ef6d7cd5369ecc |
14.9 GB | Download |
All versions | This version | |
---|---|---|
Views | 387 | 387 |
Downloads | 104 | 104 |
Data volume | 1.5 TB | 1.5 TB |
Unique views | 366 | 366 |
Unique downloads | 87 | 87 |