Published May 12, 2022
| Version v1
Dataset
Open
Word2vec models trained on English Wikipedia
Creators
Description
This repository contains Word2Vec models trained on the full text of the English Wikipedia as downloaded in December 2021.
Preprocessing:
- lowercasing
- n-grams up to 4-grams were computed using Bouma 2009 (https://svn.spraakdata.gu.se/repos/gerlof/pub/www/Docs/npmi-pfd.pdf), min freq threshold of 10
Two models, trained with Gensim:
- wiki_300_5_word2vec --> dim 300, freq threshold 5
- wiki_300_50_word2vec --> dim 300, freq threshold 50
Other hyperparameters set as follows: window=5, epochs=5, seed=1830, sg=1
Note:
Machine learning models trained on uncurated data inevitably learn hidden or obvious biases and as a result, the models shared with here might contain characteristics including sexism, racism, antisemitism, homophobia, and other such types of unacceptable biases. I encourage whoever is using these models to make sure such biases are actually removed before using them in production settings (see eg https://aclanthology.org/N19-1061/)
Files
Files
(10.2 GB)
Name | Size | Download all |
---|---|---|
md5:dcc604f03011c796fd9a4c635eb76c5d
|
29.4 MB | Download |
md5:164f40e3aad907dfbcc70cfaa02ff75f
|
995.4 MB | Download |
md5:e25978360fff3cd8c5b9f59b7fec4921
|
995.4 MB | Download |
md5:0b150a59d4c96daadb240709c379cfb8
|
117.4 MB | Download |
md5:8b5c00a18dd8ebb950d1e8232134cc94
|
4.0 GB | Download |
md5:eb7db742a3095681117d0890b16f932e
|
4.0 GB | Download |