Word2vec models trained on English Wikipedia

Hengchen, Simon

doi:10.5281/zenodo.6542975

Published May 12, 2022 | Version v1

Dataset Open

Word2vec models trained on English Wikipedia

Hengchen, Simon

This repository contains Word2Vec models trained on the full text of the English Wikipedia as downloaded in December 2021.

Preprocessing:

lowercasing
n-grams up to 4-grams were computed using Bouma 2009 (https://svn.spraakdata.gu.se/repos/gerlof/pub/www/Docs/npmi-pfd.pdf), min freq threshold of 10

Two models, trained with Gensim:

wiki_300_5_word2vec --> dim 300, freq threshold 5
wiki_300_50_word2vec --> dim 300, freq threshold 50

Other hyperparameters set as follows: window=5, epochs=5, seed=1830, sg=1

Note:
Machine learning models trained on uncurated data inevitably learn hidden or obvious biases and as a result, the models shared with here might contain characteristics including sexism, racism, antisemitism, homophobia, and other such types of unacceptable biases. I encourage whoever is using these models to make sure such biases are actually removed before using them in production settings (see eg https://aclanthology.org/N19-1061/)

Files

Files (10.2 GB)

Name	Size	Download all
wiki_300_50_word2vec.model md5:dcc604f03011c796fd9a4c635eb76c5d	29.4 MB	Download
wiki_300_50_word2vec.model.syn1neg.npy md5:164f40e3aad907dfbcc70cfaa02ff75f	995.4 MB	Download
wiki_300_50_word2vec.model.wv.vectors.npy md5:e25978360fff3cd8c5b9f59b7fec4921	995.4 MB	Download
wiki_300_5_word2vec.model md5:0b150a59d4c96daadb240709c379cfb8	117.4 MB	Download
wiki_300_5_word2vec.model.syn1neg.npy md5:8b5c00a18dd8ebb950d1e8232134cc94	4.0 GB	Download
wiki_300_5_word2vec.model.wv.vectors.npy md5:eb7db742a3095681117d0890b16f932e	4.0 GB	Download

	All versions	This version
Views	8,370	8,078
Downloads	655	648
Data volume	1.1 TB	1.1 TB

Word2vec models trained on English Wikipedia

Creators

Description

Files

Files (10.2 GB)