Dataset Open Access

Word embeddings learnt on MEDLINE abstracts

Major, Vincent; Surkis, Alisa; Aphinyanaphongs, Yindalon

Accompanying a preprint manuscript and code repository, this folder contains both raw text data and learnt word embeddings. The data source is the set of MEDLINE articles published on or after 2000. Preprocessing consists of extraction of each article's title and abstract and some minor text processing. The result is a corpus of 10.5 million documents in a single 14 GB file. 

word2vec and fastText are used to learn word embeddings on this corpus and three sets of word embeddings are shared here: 1) word2vec skip-gram, 2) word2vec CBOW, and 3) fastText skip-gram. All three sets use the default parameters of the software (e.g. context=5) with the exception of hierarchical softmax optimization and dimension=200.

Preprint manuscript: https://arxiv.org/abs/1705.06262
GitHub repository: https://github.com/vincentmajor/ctsa_prediction

Files (6.6 GB)
Name Size
all_medline_post2000.txt.gz
md5:a43315be7a8649eec79aa99e63b22dd8
4.6 GB Download
fasttext_skip_hier.vec.gz
md5:d814de43882022ef3aff106d7fa8a76d
601.9 MB Download
word2vec_cbow_hier.vec.gz
md5:e1ccf3c057965da37b0088ebf9fe11fb
718.5 MB Download
word2vec_skip_hier.vec.gz
md5:940b9bbb71c6b92b9dd0f57b5d422f31
687.1 MB Download
97
85
views
downloads
All versions This version
Views 9797
Downloads 8585
Data volume 336.6 GB336.6 GB
Unique views 9595
Unique downloads 5353

Share

Cite as