Published June 3, 2019
| Version v1
Dataset
Open
Twitter pre-trained word vectors
Contributors
Related person:
Description
Clean up of glove.twitter.27B.zip <ODC Public Domain Dedication and Licence (PDDL) 1.0>.
"2B tweets, 27B tokens, 1.2M vocab, uncased"
Changes from original
- Headers added to allow loading by gensim. [Added via scripts.glove2word2vec]
- Recompressed as individual gzip files [Instead of a combined zip].
These changes make the files easier to work with and increase compatibility.
Headers
Example of added header line
1193513 200
Header gives number of tokens and dimensions.
Statistics
- Entries: 1,193,513
- Token length (characters). Min:1, Max:140, Avg:6.73
- Number of words per token. Min:0, Max:17, Avg:1.00669200921984
- Tokens with more than one word: 4874 (0.41%)
- Twitter data collection date: Unknown.
History:
- ?? Aug 2014 — GloVe v.1.0 released
- 16 Aug 2014 — Files first appear as headerless .txt.gz files, some files have mislabeled linked (via wayback machine)
- ?? Oct 2015 — GloVe v.1.2 released
- ?? ??? ???? — Files replaced with a .zip file
- 03 June 2019 — (These files) Repackaged like original as .txt.gz, plus added headers for increased compatibility
Example of 17-word token:
سكس_طيز_قحبه_عنيف_اغتصاب_سكسيه_فحل_زب_نيك_بنات_مكوه_شهوه_لحس_عنف_تومبوي_ليدي_سبورت
200d file:
- Normalized: no
- Values: Min:-6.7986, Max:4.609, Avg:0.009065093
- Zero values (exactly zero): none
- Zero values (approx zero) per entry: Min:0 (0.00%), Max:2 of 200 (1.0%), Avg:0.00375865197949247 (0.00%)
Files
Files
(1.5 GB)
Name | Size | Download all |
---|---|---|
md5:5b55d65862fdb30d98b5f40f266fbbaa
|
405.9 MB | Download |
md5:9f2b9b3a31dc89437dcfef8d0787b3ff
|
795.4 MB | Download |
md5:cdeb1fd4b7c17c33bfa79b8593c7fafa
|
109.9 MB | Download |
md5:6d736870ef6be05fa74195efd23e104a
|
209.2 MB | Download |