HealthE

Joseph Gatto; Parker Seegmiller; Garrett Johnston; Madhusudan Basak; Sarah Masud Preum

doi:10.5281/zenodo.7539392

Published January 15, 2023 | Version v1

Dataset Open

HealthE

1. Dartmouth College

# HealthE Dataset

HealthE contains 3,400 pieces of health advice gathered 1) from public health websites (i.e. WebMD.com, MedlinePlus.gov, CDC.gov, and MayoClinic.org) 2) from the publicly available [Preclude dataset]([https://userpages.umbc.edu/~nroy/courses/shhasp18/papers/p286-preum.pdf](https://userpages.umbc.edu/~nroy/courses/shhasp18/papers/p286-preum.pdf)). Each sample was hand-labeled for health entity recognition by a team of 14 annotators at the author's institution. Automatic recognition of health entities will enable further research in large-scale modeling of texts from online health communities.

The data is provided in two parts. Both are formatted using the popular, free python `pickle` library and require use of the popular, free `pandas` library.

`healthe.pkl` is a `pandas.DataFrame` object containing the 3,400 health-advice statement with hand-labeled health entities.

`non_advice.pkl` is a `pandas.DataFrame` object containing the 2,256 pieces of non-advice statements.

To load the files in python, use the following code block.
```
import pickle
import pandas as pd
healthe_df = pd.read_pickle('healthe.pkl')
non_advice_df = pd.read_pickle('non_advice_df.pkl')
```

`healthe_df` has four columns.
* `text` contains the health advice statement text
* `entities` contains a python list of (entity, class) tuples
* `tokenized_text` contains a list of tokens obtained by tokenizing the health advice statement text
* `labels` contains a list of the same length as `tokenized_text`, where each token is mapped to a class label.

`non_advice_df` has one column, `text`, referring to each non-health-advice-statement.

Files

README.txt

Files (2.2 MB)

Name	Size	Download all
healthe.pkl md5:272b743593b9b4322f12a68c44c1f5e9	1.3 MB	Download
non_advice.pkl md5:11a922588b8b382aaf74b2a4f1ef7a2c	915.2 kB	Download
README.txt md5:12a400271324c48800ab5c72a45435e0	1.7 kB	Preview Download

	All versions	This version
Views	262	254
Downloads	77	75
Data volume	67.1 MB	63.6 MB

HealthE

Creators

Description

Files

README.txt

Files (2.2 MB)