Word embeddings learnt on MEDLINE abstracts
- 1. Department of Population Health, NYU School of Medicine, New York, USA
- 2. Health Sciences Library, NYU School of Medicine, New York, USA
Description
Accompanying a preprint manuscript and code repository, this folder contains both raw text data and learnt word embeddings. The data source is the set of MEDLINE articles published on or after 2000. Preprocessing consists of extraction of each article's title and abstract and some minor text processing. The result is a corpus of 10.5 million documents in a single 14 GB file.
word2vec and fastText are used to learn word embeddings on this corpus and three sets of word embeddings are shared here: 1) word2vec skip-gram, 2) word2vec CBOW, and 3) fastText skip-gram. All three sets use the default parameters of the software (e.g. context=5) with the exception of hierarchical softmax optimization and dimension=200.
Preprint manuscript: https://arxiv.org/abs/1705.06262
GitHub repository: https://github.com/vincentmajor/ctsa_prediction
Files
Files
(6.6 GB)
Name | Size | Download all |
---|---|---|
md5:a43315be7a8649eec79aa99e63b22dd8
|
4.6 GB | Download |
md5:d814de43882022ef3aff106d7fa8a76d
|
601.9 MB | Download |
md5:e1ccf3c057965da37b0088ebf9fe11fb
|
718.5 MB | Download |
md5:940b9bbb71c6b92b9dd0f57b5d422f31
|
687.1 MB | Download |
Additional details
Related works
- Is supplement to
- arXiv:1705.06262 (arXiv)