Dataset Open Access

Russian Distributional Thesaurus (RDT): Word Embeddings

Alexander Panchenko; Nikolay Arefyev; Dmitry Ustalov; Natalia Loukachevitch; Denis Paperno; Chris Biemann; Natalia Konstantinova


Dublin Core Export

<?xml version='1.0' encoding='utf-8'?>
<oai_dc:dc xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
  <dc:creator>Alexander Panchenko</dc:creator>
  <dc:creator>Nikolay Arefyev</dc:creator>
  <dc:creator>Dmitry Ustalov</dc:creator>
  <dc:creator>Natalia Loukachevitch</dc:creator>
  <dc:creator>Denis Paperno</dc:creator>
  <dc:creator>Chris Biemann</dc:creator>
  <dc:creator>Natalia Konstantinova</dc:creator>
  <dc:date>2017-03-18</dc:date>
  <dc:description>This resource is a part of the Russian Distributional Thesaurus (RDT): see http://russe.nlpub.ru/downloads and http://nlpub.ru/RDT. 

This dataset contains a large scale word embeddings model for Russian trained using the SGNS model (Mikolov et al., 2013) on a 12.9 billion word collection of books in Russian. According to the results of our participation in the shared task on Russian semantic similarity (Panchenko et al., 2015), this approach scored in the top 5 among 105 submissions (Arefyev et al., 2015). Following our prior experiments (Arefyev et al., 2015) we have selected the following parameters for the model: minimal word frequency – 5, number of dimensions in a word vector – 500, three or five iterations of the learning algorithm over the input corpus, context window size of 1, 2, 3, 5, 7 and 10 words. Parameters of the model are listed below:


	Model: skip-gram
	Corpus: a 150Gb sample of the lib.rus.ec book collection.
	Context window size: 10 words
	Number of dimensions: 500
	Number of iterations: 3
	Minimal word frequency: 5


References:


	Panchenko A., Ustalov D., Arefyev N., Paperno D., Konstantinova N., Loukachevitch N. and Biemann C. (2016): Human and Machine Judgements about Russian Semantic Relatedness. In Proceedings of the 5th Conference on Analysis of Images, Social Networks, and Texts (AIST'2016). Communications in Computer and Information Science (CCIS). Springer-Verlag Berlin Heidelberg



	Panchenko A., Loukachevitch N. V., Ustalov D., Paperno D., Meyer C. M., Konstantinova N. (2015): RUSSE: The First International Workshop on Russian Semantic Similarity. In Proceedings of the 21st International Conference on Computational Linguistics and Intellectual Technologies (Dialogue'2015). Moscow, Russia. RGGU



	Arefyev N., Panchenko A., Lukanin A., Lesota O., Romanov P. (2015): Evaluating Three Corpus-Based Semantic Similarity Systems for Russian. In Proceedings of the 21st International Conference on Computational Linguistics and Intellectual Technologies (Dialogue'2015). Moscow, Russia. RGGU
</dc:description>
  <dc:identifier>https://zenodo.org/record/400631</dc:identifier>
  <dc:identifier>10.5281/zenodo.400631</dc:identifier>
  <dc:identifier>oai:zenodo.org:400631</dc:identifier>
  <dc:rights>info:eu-repo/semantics/openAccess</dc:rights>
  <dc:rights>http://creativecommons.org/licenses/by/4.0/legalcode</dc:rights>
  <dc:subject>word embeddings</dc:subject>
  <dc:subject>distributional semantics</dc:subject>
  <dc:subject>Russian</dc:subject>
  <dc:subject>Russian language</dc:subject>
  <dc:subject>word vectors</dc:subject>
  <dc:subject>word2vec</dc:subject>
  <dc:subject>SGNS</dc:subject>
  <dc:title>Russian Distributional Thesaurus (RDT): Word Embeddings</dc:title>
  <dc:type>info:eu-repo/semantics/other</dc:type>
  <dc:type>dataset</dc:type>
</oai_dc:dc>
232
43
views
downloads
All versions This version
Views 232232
Downloads 4343
Data volume 640.5 GB640.5 GB
Unique views 220220
Unique downloads 3737

Share

Cite as