Published February 1, 2018 | Version V1
Dataset Open

A Corpus of Online Drug Usage Guideline Documents Annotated with Type of Advice

  • 1. University of Virginia
  • 2. University of California, Los Angeles


Introduction: The goal of this dataset is to aid NLP research on recognizing safety critical information from drug usage guideline or patient handout data. This dataset contains annotated advice statements from 90 online DUG documents that corresponds to 90 drugs or medications that are used in the prescriptions of patients suffering from one or more chronic diseases. The advice statements are annotated in eight safety-critical categories: activity or lifestyle related, disease or symptom related, drug administration related, exercise related, food or beverage related, other drug related, pregnancy related, and temporal. 

Data Collection: The data was collected from MedScape. It is one of the most widely used reference for health care providers. At first, 34 real anonymized prescriptions of patients suffering from one or more chronic diseases are collected. These prescriptions contains 165 drugs that are used to treat chronic diseases. Then, MedScape was crawled to collect the drug user guideline (DUG) / patient handout for these 165 drugs. But, MedScape does not have DUG document for all drugs. We found DUG document for 90 drugs in MedScape. 

Data Annotation tool: The data annotation tool is developed to ease the annotation process. It allows the user to select a DUG document and select a position from the document in terms of line number. It stores the user log from the annotator and loads the most recent position from the log when the application is launched. It supports annotating multiple files for the same drug, as often there are multiple overlapping sources of drug usage guidelines for a single drug. Often DUG documents contain formatted text. This tool aids annotation of the formatted text as well. The annotation tool is also available upon request. 

Annotated Data Description: The annotated data contains the annotation tag(s) of each advice extracted from the 90 online DUG documents. It also contains the phrases or topics in the advice statement that triggers the annotation tag, such as, activity, exercise, medication name, food or beverage name, disease name, pregnancy condition (gestational, postpartum). Sometimes disease names are not directly mentioned rather mentioned as a condition (e.g., stomach bleeding, alcohol abuse) or state of a parameter (e.g., low blood sugar, low blood pressure). The annotated data is formatted as following:
drug name, drug number, line number of the first sentence of the advice in the DUG document, advice Text, advice tag(s), medication, food, activity, exercise, and disease names mentioned in the advice. 

Unannotated Data Description:
The unannotated data contains the raw DUG document for 90 drugs. It also contains the drug interaction information for the 165 drugs. The drug interaction information is categorized in 4 classes, contraindicated, serious, monitor closely, and minor. This information can be utilized to automatically detect potential interaction and effect of interaction among multiple drugs. 

Citation: If you use this dataset in your work, please cite the following reference in any publication:

title={A Corpus of Drug Usage Guidelines Annotated with Type of Advice},
author={Sarah Masud Preum, Md. Rizwan Parvez, Kai-Wei Chang, and John A. Stankovic},
booktitle={ Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)},
publisher = {European Language Resources Association (ELRA)},


Files (1.4 MB)

Name Size Download all
1.4 MB Preview Download