00000nmm##2200000uu#4500 4421380 doi 10.5281/zenodo.4421380 oai:zenodo.org:4421380 user-nexuslinguarum user-eu user-pret-a-llod Tomas Mikolov et al. Lemmatized English Word2Vec data Christian Chiarcos (orcid)0000-0002-4428-029X Goethe University Frankfurt, Germany info:eu-repo/semantics/openAccess Apache License 2.0 http://www.apache.org/licenses/LICENSE-2.0 apache-2.0 spdx word embeddings word2vec English # Lemmatized English Word2Vec data This is a version of the original GoogleNews-vectors-negative300 Word2Vec embeddings for English. In addition, we provide the following modified files: - converted to conventional CSV format (and gzipped) - subclassified:   for the most frequent 1.000.000 words:     subclassified according to WordNet parts of speech: ADJ, ADV, NOUN, VERB, OTHER     note that one embedding can be associated with multiple parts of speech   for the remaining words:     RARE: top 1.000.001 - 2.000.000 words     VERY_RARE: top 2.000.001 - 3.000.000 words - WordNet lemmatization (via NLTK) in separate files     (first lemma only) Note that this is not a product of original research, but a derived work, deposited here as a point of permanent reference and as a building stone of subsequent research. For such application, a publication independent from Google is necessary to guarantee stability against changes in their data releases. The original Word2vec code and data was published via https://code.google.com/archive/p/word2vec/ under an Apache License 2.0. We obtained the Word2vec data from  https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing on Jun 3, 2020. The Word2vec documentation included the following references:     [1] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. In Proceedings of Workshop at ICLR, 2013.     [2] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of NIPS, 2013.     [3] Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic Regularities in Continuous Space Word Representations. In Proceedings of NAACL HLT, 2013. The derived data is made available under the same license (Apache License 2.0). However, note that the content derived from WordNet (lemmas) are subject to the Princeton Wordnet license as stated in LICENSE.wordnet. Data provided by the Applied Computational Linguistics Lab of the Goethe University Frankfurt, Germany. Original data developed by Mikolov et al. Partially funded by the German Federal Ministry of Education and Research (BMBF), project "Linked Open Dictionaries". eng Zenodo 2021-01-06 user-nexuslinguarum user-eu user-pret-a-llod info:eu-repo/semantics/other 825182 Ready-to-use Multilingual Linked Language Data for Knowledge Services across Sectors 20210107122721.0 443820238 md5:39d0f28cc010daf0350f8d54c8d5486c https://zenodo.org/records/4421380/files/GoogleNews-vectors-negative300.txt_OTHER.csv.gz 1647046227 md5:1c892c4707a8a1a508b01a01735c0339 https://zenodo.org/records/4421380/files/GoogleNews-vectors-negative300.bin.gz 21933 md5:e3c02ad8f9a0fdbdfe5c464f1b266453 https://zenodo.org/records/4421380/files/GoogleNews-vectors-negative300.txt_OTHER.lemmas 2000 md5:9126c091a774036afc13c4453276c55b https://zenodo.org/records/4421380/files/README.md 10792573 md5:999cb5e1b5f5e4a455a7546aec6d5f7b https://zenodo.org/records/4421380/files/GoogleNews-vectors-negative300.txt_ADJ.csv.gz 1590 md5:bc01d92eb6af5635c4038bec7ec833e1 https://zenodo.org/records/4421380/files/LICENSE.wordnet 108811 md5:22b97495119c703a144cbe154dd72c31 https://zenodo.org/records/4421380/files/GoogleNews-vectors-negative300.txt_ADV.lemmas 242847 md5:7e3e99e45f8f4ae366576e31b9fc49c4 https://zenodo.org/records/4421380/files/GoogleNews-vectors-negative300.txt_ADJ.lemmas 38571098 md5:df34194305a427876235a342592aa899 https://zenodo.org/records/4421380/files/GoogleNews-vectors-negative300.txt_VERB.csv.gz 4317304 md5:65f90c77d7146c6d4a543228ade30ffa https://zenodo.org/records/4421380/files/GoogleNews-vectors-negative300.txt_ADV.csv.gz 309467790 md5:e07c9befabc0bbc2dc63027d8a3f0303 https://zenodo.org/records/4421380/files/GoogleNews-vectors-negative300.txt_NOUN.csv.gz 775339 md5:1fc2e14f0f59791da5e1760fb0e0a888 https://zenodo.org/records/4421380/files/GoogleNews-vectors-negative300.txt_VERB.lemmas 824637315 md5:41333adfcdbe13138ab03b20cdbc4c64 https://zenodo.org/records/4421380/files/GoogleNews-vectors-negative300.txt_VERY_RARE.csv.gz 11358 md5:3b83ef96387f14655fc854ddc3c6bd57 https://zenodo.org/records/4421380/files/LICENSE 6327664 md5:2bcb8f51b665986c36251face380807c https://zenodo.org/records/4421380/files/GoogleNews-vectors-negative300.txt_NOUN.lemmas 811570257 md5:3861e08f3ac52d410fb7698d09eea0b8 https://zenodo.org/records/4421380/files/GoogleNews-vectors-negative300.txt_RARE.csv.gz 1209 md5:6a79261a4c4572d7ef150173b5c381e8 https://zenodo.org/records/4421380/files/README.word2vec open 10.5281/zenodo.4421379 isVersionOf doi