Extracting Biomedical Entities from Noisy Audio Transcripts--Dataset
Creators
Description
SUMMARY:
This repo contains the CADEC and Synthetic BTACT datasets that were used for the paper titled "Extracting Biomedical Entities from Noisy Audio Transcripts."
The dataset includes two sets: i) CADEC (Karimi et al., 2015) and ii) Synthetic BTACT. CADEC is a well-known NER dataset used to identify adverse drug reactions based on what patients have written about their experiences. Synthetic BTACT is the data that we have made up. It is created based on questions similar to those in the Brief Test of Adult Cognition by Telephone (BTACT)(Tun et al., 2006).
CADEC includes two sets of audio files; one is read from the original CADEC, and the other one is with additional audio noise. It also includes the original CADEC scripts, annotations, and the transcripts of the noisy audio. The transcripts are generated using Whisper. The annotations encompass named entities, their types, and string indexes of their occurrence in the text. Annotations also include "AnnotatorNotes" which explains some of the annotations.
The synthetic BTACT data include two types: i) animals and ii) fruits. Similar to CADEC, it includes two sets of audio files: one that is read from the original scripts and another one with additional audio noise. The text files include the original scripts, annotations, and the Whisper-transcribed of the noisy audio files. The annotations include indexes of named entities, their string indices and types.
REFERENCES:
Karimi, S., Metke-Jimenez, A., Kemp, M., & Wang, C. (2015). Cadec: A corpus of adverse drug event annotations. Journal of biomedical informatics, 55, 73-81.
Tun, P. A., & Lachman, M. E. (2006). Telephone assessment of cognitive function in adulthood: the Brief Test of Adult Cognition by Telephone. Age and Ageing, 35(6), 629-632.
DETAILS:
Data_1: CADEC (1250 TextFiles, 1000 Audio, types=5):
General Categories and Counts
ADR (Adverse Drug Reactions): 5316
DRUG: 1797
FINDING: 397
DISEASE: 280
SYMPTOM: 255
Specific Items (Drugs) and Counts
Arthrotec: 145
cambia: 4
cataflam: 10
diclofenac-potassium: 3
diclofenac-sodium: 7
flector: 1
Lipitor: 997
Pennsaid: 4
solarez: 3
voltaren: 46
voltaren-rx: 22
zipsor: 5
Data_2: Synthetic BTACT (500 Fruits, 500 Animals, types=2)
>> Audios can be matched with annotations, scripts and transcripts using their filenames.
---
audio [original&noisy]:
1. cadec
1.1 cadec original
1.2 cadec noisy
2. synthetic btact
2.1 btact original
2.1.1 fruits
fruit-script[0:500].mp3
2.1.2 animals
script-[0:500].mp3
2.2 btact noisy
2.2.1 fruits
fruit-script[0:500].mp3
2.2.2 animals
script-[0:500].mp3
text[scripts, annotations, transcripts]:
1. cadec
1.1 scripts [1,250]
1.2 annotations [1,250] (index/AnnotatorsNote, type, indices, named-entities)
1.3 transcripts [1,000]
2. synthetic btact
2.1 animals
2.1 scripts (original scripts)
script-[0:500].txt
2.2 annotations
script-[0:500].ann (index, type, start/end indices, named entity)
2.3 transcripts
script-[0:500].txt
2.2. fruits
2.1 scripts (original scripts)
script-[0:500].txt
2.2 annotations
script-[0:500].ann (index, type, start/end indices, named entity)
2.3 transcripts
fruit-script-[0:500].txt
CITATION:
Ebadi, N., Morgan, K., Tan, A., Linares, B., Osborn, S., Majors, E., Davis, J., & Rios, A. (2024). Extracting biomedical entities from noisy audio transcripts. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024).
Files
bio-ner.zip
Files
(1.7 GB)
Name | Size | Download all |
---|---|---|
md5:61b0692e34813256cca5be4a49d48ee2
|
1.7 GB | Preview Download |
Additional details
Dates
- Other
-
2023-10-05Created, Mixed, Transcribed