Published December 21, 2020 | Version v1
Other Open

BERT-CRel: Improved Biomedical Word Embeddings in the Transformer Era

  • 1. University of Kentucky

Description

BERT-CRel is a transformer model for fine-tuning biomedical word embeddings that are jointly learned along with concept embeddings using a pre-training phase with fastText and a fine-tuning phase with a transformer setup. The goal is to provide high quality pre-trained biomedical embeddings that can be used in any downstream task by the research community. The corpus used for BERT-CRel contains biomedical citations from PubMed and the concepts are from the Medical Subject Headings (MeSH codes) terminology used to index citations. 

BERT-CRel-all

This contains word embeddings and all the MeSH descriptors and a subset of supplementary concepts each of which meets a frequency threshold. Vocabulary is divided into three sections: (1) BERT special tokens (2) MeSH codes (3) English words in descending frequency order. (vocabulary size is 333,301)

BERT-CRel-MeSH

These files contain only MeSH code embeddings. (vocabulary size is 45,015)

BERT-CRel-words

These files contain only English word embeddings. (vocabulary size is 288,281)

 

More details can be found in our pre-print ()

Files

Files (3.6 GB)

Name Size Download all
md5:6f85d34c197368796f30a126b7176578
531.0 MB Download
md5:1bb7e584d4503592292a8c244a698321
1.3 GB Download
md5:a5ac5b7a419726a3d5eb2d4e2a22dbef
71.9 MB Download
md5:86c150da930987fd6746e67d1f9b8449
168.0 MB Download
md5:f6bb738f5cf22323d0effde81eb995e9
459.1 MB Download
md5:dbf234c788fe1714238d0ddb95e7483e
1.1 GB Download