Published May 12, 2022 | Version v1
Dataset Open

Word2vec models trained on English Wikipedia

Description

This repository contains Word2Vec models trained on the full text of the English Wikipedia as downloaded in December 2021.

Preprocessing:

  • lowercasing
  • n-grams up to 4-grams were computed using Bouma 2009 (https://svn.spraakdata.gu.se/repos/gerlof/pub/www/Docs/npmi-pfd.pdf), min freq threshold of 10

Two models, trained with Gensim:

  • wiki_300_5_word2vec --> dim 300, freq threshold 5
  • wiki_300_50_word2vec  --> dim 300, freq threshold 50

Other hyperparameters set as follows: window=5, epochs=5, seed=1830, sg=1

Note:
Machine learning models trained on uncurated data inevitably learn hidden or obvious biases and as a result, the models shared with here might contain characteristics including sexism, racism, antisemitism, homophobia, and other such types of unacceptable biases. I encourage whoever is using these models to make sure such biases are actually removed before using them in production settings (see eg https://aclanthology.org/N19-1061/)

Files

Files (10.2 GB)

Name Size Download all
md5:dcc604f03011c796fd9a4c635eb76c5d
29.4 MB Download
md5:164f40e3aad907dfbcc70cfaa02ff75f
995.4 MB Download
md5:e25978360fff3cd8c5b9f59b7fec4921
995.4 MB Download
md5:0b150a59d4c96daadb240709c379cfb8
117.4 MB Download
md5:8b5c00a18dd8ebb950d1e8232134cc94
4.0 GB Download
md5:eb7db742a3095681117d0890b16f932e
4.0 GB Download