MeDAL

Wen, Zhi; Lu, Xing Han; Reddy, Siva

doi:10.5281/zenodo.4482900

Published November 9, 2020 | Version v3

Conference paper Open

MeDAL

1. McGill University

Medical Dataset for Abbreviation Disambiguation for Natural Language Understanding (MeDAL) is a large medical text dataset curated for abbreviation disambiguation, designed for natural language understanding pre-training in the medical domain. It was published at the ClinicalNLP workshop at EMNLP.

📜 Paper
💻 Code
💾 Dataset (Kaggle)
💽 Dataset (Zenodo)
🤗 Pre-trained ELECTRA (Hugging Face)

Downloading the data

We recommend downloading from Zenodo if you do not want to authenticate through Kaggle. The downside to Zenodo is that the data is uncompressed, so it will take more time to download. Links to the data can be found at the top of the readme. To download from Zenodo, simply do:

wget -nc -P data/ https://zenodo.org/record/4276178/files/full_data.csv

If you want to reproduce our pre-training results, you can download only the pre-training data below:

wget -nc -P data/ https://zenodo.org/record/4276178/files/train.csv
wget -nc -P data/ https://zenodo.org/record/4276178/files/valid.csv
wget -nc -P data/ https://zenodo.org/record/4276178/files/test.csv

Model Quickstart

Using Torch Hub

You can directly load LSTM and LSTM-SA with torch.hub:

import torch

lstm = torch.hub.load("BruceWen120/medal", "lstm")
lstm_sa = torch.hub.load("BruceWen120/medal", "lstm_sa")

If you want to use the Electra model, you need to first install transformers:

pip install transformers

Then, you can load it with torch.hub:

import torch
electra = torch.hub.load("BruceWen120/medal", "electra")

Using Huggingface transformers

If you are only interested in the pre-trained ELECTRA weights (without the disambiguation head), you can load it directly from the Hugging Face Repository:

from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained("xhlu/electra-medal")
tokenizer = AutoTokenizer.from_pretrained("xhlu/electra-medal")

Citation

Download the bibtex here, or copy the text below:

@inproceedings{wen-etal-2020-medal,
    title = "{M}e{DAL}: Medical Abbreviation Disambiguation Dataset for Natural Language Understanding Pretraining",
    author = "Wen, Zhi and Lu, Xing Han and Reddy, Siva",
    booktitle = "Proceedings of the 3rd Clinical Natural Language Processing Workshop",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.clinicalnlp-1.15",
    pages = "130--135",
}

License, Terms and Conditions

The ELECTRA model is licensed under Apache 2.0. The license for the libraries used in this project (transformers, pytorch, etc.) can be found in their respective GitHub repository. Our model is released under a MIT license.

The original dataset was retrieved and modified from the NLM website. By using this dataset, you are bound by the terms and conditions specified by NLM:

INTRODUCTION

Downloading data from the National Library of Medicine FTP servers indicates your acceptance of the following Terms and Conditions: No charges, usage fees or royalties are paid to NLM for this data.

MEDLINE/PUBMED SPECIFIC TERMS

NLM freely provides PubMed/MEDLINE data. Please note some PubMed/MEDLINE abstracts may be protected by copyright.

GENERAL TERMS AND CONDITIONS

Users of the data agree to:

acknowledge NLM as the source of the data by including the phrase "Courtesy of the U.S. National Library of Medicine" in a clear and conspicuous manner,

properly use registration and/or trademark symbols when referring to NLM products, and

not indicate or imply that NLM has endorsed its products/services/applications.

Users who republish or redistribute the data (services, products or raw data) agree to:

maintain the most current version of all distributed data, or

make known in a clear and conspicuous manner that the products/services/applications do not reflect the most current/accurate data available from NLM.

These data are produced with a reasonable standard of care, but NLM makes no warranties express or implied, including no warranty of merchantability or fitness for particular purpose, regarding the accuracy or completeness of the data. Users agree to hold NLM and the U.S. Government harmless from any liability resulting from errors in the data. NLM disclaims any liability for any consequences due to use, misuse, or interpretation of information contained or not contained in the data.

NLM does not provide legal advice regarding copyright, fair use, or other aspects of intellectual property rights. See the NLM Copyright page.

NLM reserves the right to change the type and format of its machine-readable data. NLM will take reasonable steps to inform users of any changes to the format of the data before the data are distributed via the announcement section or subscription to email and RSS updates.

Files

full_data.csv

Files (23.1 GB)

Name	Size
full_data.csv md5:6aafd66617438cee6e4fb43a473166a7	15.2 GB	Preview Download
pretrain_subset.zip md5:72465461fd42561b7317d18ec2640d33	2.1 GB	Preview Download
test.csv md5:7ba9fd2e9fa5b4069370bafa42418a85	1.2 GB	Preview Download
train.csv md5:28d5c3e0b4d80f95b7e20cebfc783379	3.5 GB	Preview Download
valid.csv md5:28e450e8c5d5542c6b3ba98876adf621	1.2 GB	Preview Download

	All versions	This version
Views	4,657	198
Downloads	7,306	273
Data volume	94.5 TB	1.5 TB

MeDAL

Authors/Creators

Description

Files

full_data.csv

Files (23.1 GB)