Published May 24, 2020 | Version v1
Dataset Open

Medical Concept Embeddings for SNOMED-CT (Jan 2019 version)

  • 1. IIIT Hyderabad
  • 2. TCS Research

Description

This dataset contains the SNOMED-CT medical concept embeddings trained using the following text and graph embedding methods.

  • Averaged Word Embedding (300)
  • ELMo (1024)
  • Universal Sentence Encoder (512)
  • BERT (768)
  • Deepwalk (128)
  • Node2Vec (128)
  • HARP (128)
  • LINE (128)

The tar file contains eight JSON files corresponding to the aforementioned embedding techniques. The number (in parenthesis) besides each embedding method represents the dimensionality of the embedding. Each JSON file contains a python dictionary of the form

SNOMED concept ID (String): Embedding (List).

If you find this resource useful in your research, please consider citing our paper:

"Pattisapu, N., Patil, S., Palshikar, G. and Varma, V., Medical Concept Normalization by Encoding Target Knowledge, Proceedings of Machine Learning Research 116:246–259, 2020 Machine Learning for Health (ML4H) at NeurIPS 2019"

Warning: The dataset size is large (~12 GB). Please ensure that you have sufficient network bandwidth and disk space before requesting a download.

Files

Files (12.2 GB)

Name Size Download all
md5:9d0a5e1d0a9261f345933cbb649487c5
12.2 GB Download

Additional details

Related works