BERT-CRel: Improved Biomedical Word Embeddings in the Transformer Era
Description
BERT-CRel is a transformer model for fine-tuning biomedical word embeddings that are jointly learned along with concept embeddings using a pre-training phase with fastText and a fine-tuning phase with a transformer setup. The goal is to provide high quality pre-trained biomedical embeddings that can be used in any downstream task by the research community. The corpus used for BERT-CRel contains biomedical citations from PubMed and the concepts are from the Medical Subject Headings (MeSH codes) terminology used to index citations.
BERT-CRel-all
This contains word embeddings and all the MeSH descriptors and a subset of supplementary concepts each of which meets a frequency threshold. Vocabulary is divided into three sections: (1) BERT special tokens (2) MeSH codes (3) English words in descending frequency order. (vocabulary size is 333,301)
BERT-CRel-MeSH
These files contain only MeSH code embeddings. (vocabulary size is 45,015)
BERT-CRel-words
These files contain only English word embeddings. (vocabulary size is 288,281)
More details can be found in our pre-print ()
Files
Files
(3.6 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:6f85d34c197368796f30a126b7176578
|
531.0 MB | Download |
|
md5:1bb7e584d4503592292a8c244a698321
|
1.3 GB | Download |
|
md5:a5ac5b7a419726a3d5eb2d4e2a22dbef
|
71.9 MB | Download |
|
md5:86c150da930987fd6746e67d1f9b8449
|
168.0 MB | Download |
|
md5:f6bb738f5cf22323d0effde81eb995e9
|
459.1 MB | Download |
|
md5:dbf234c788fe1714238d0ddb95e7483e
|
1.1 GB | Download |