Published November 7, 2018 | Version 2.0
Dataset Open

eDiseases Dataset


The eDiseases dataset contains patient data from the MedHelp health site (, where different communities share information and opinions about diseases. Each community consists of a number of conversations; a conversation being a sequence of comments posted by patients.

To build the dataset, we automatically extracted 10 conversations from each of the following three communities: allergies, crohn and breast cancer. We selected a set of diseases that, according to medical expert, show high heterogeneity concerning both the degree of medical understanding of the diseases and the profile of the users. The conversations were selected randomly, but we automatically filtered out conversations with less than 10 posts. In total, we extracted 146 posts for allergies, 191 posts for crohn, and 142 posts for breast cancer; which include 983 sentences for allergies, 1780 sentences for crohn, and 1029 sentences for breast cancer, covering a 6 years time interval. Three frequent users of health forums annotated each sentence in the dataset as:


In case of doubt, the annotators labeled the sentence as NOT_LABELED. As a result, we collected 967 labeled sentences for allergies, 1,709 labeled sentences 294 for crohn, and 959 labeled sentences for breast cancer.


Files (105.5 kB)

Name Size Download all
105.5 kB Download