Published June 3, 2019 | Version v1
Dataset Open

Twitter pre-trained word vectors

  • 1. Computer Science Department

Contributors

Related person:

Description

Clean up of glove.twitter.27B.zip <ODC Public Domain Dedication and Licence (PDDL) 1.0>.

"2B tweets, 27B tokens, 1.2M vocab, uncased"

Changes from original

  • Headers added to allow loading by gensim. [Added via scripts.glove2word2vec]
  • Recompressed as individual gzip files [Instead of a combined zip].

These changes make the files easier to work with and increase compatibility.

Headers

Example of added header line

1193513 200

Header gives number of tokens and dimensions.

Statistics

  • Entries: 1,193,513
  • Token length (characters). Min:1, Max:140, Avg:6.73
  • Number of words per token. Min:0, Max:17, Avg:1.00669200921984
  • Tokens with more than one word: 4874 (0.41%)
  • Twitter data collection date: Unknown.

History:

  • ?? Aug 2014 — GloVe v.1.0 released
  • 16 Aug 2014 — Files first appear as headerless .txt.gz files, some files have mislabeled linked (via wayback machine)
  • ?? Oct 2015 — GloVe v.1.2 released
  • ?? ??? ???? — Files replaced with a .zip file
  • 03 June 2019 — (These files) Repackaged like original as .txt.gz, plus added headers for increased compatibility

Example of 17-word token:

 سكس_طيز_قحبه_عنيف_اغتصاب_سكسيه_فحل_زب_نيك_بنات_مكوه_شهوه_لحس_عنف_تومبوي_ليدي_سبورت

200d file:

  • Normalized: no
  • Values:  Min:-6.7986, Max:4.609, Avg:0.009065093
  • Zero values (exactly zero): none
  • Zero values (approx zero) per entry: Min:0 (0.00%), Max:2 of 200 (1.0%), Avg:0.00375865197949247 (0.00%)

Files

Files (1.5 GB)

Name Size Download all
md5:5b55d65862fdb30d98b5f40f266fbbaa
405.9 MB Download
md5:9f2b9b3a31dc89437dcfef8d0787b3ff
795.4 MB Download
md5:cdeb1fd4b7c17c33bfa79b8593c7fafa
109.9 MB Download
md5:6d736870ef6be05fa74195efd23e104a
209.2 MB Download