Published March 23, 2024 | Version 1.0
Dataset Open

Extracting Biomedical Entities from Noisy Audio Transcripts--Dataset

Description

SUMMARY:

This repo contains the CADEC and Synthetic BTACT datasets that were used for the paper titled "Extracting Biomedical Entities from Noisy Audio Transcripts."

The dataset includes two sets: i) CADEC (Karimi et al., 2015) and ii) Synthetic BTACT. CADEC is a well-known NER dataset used to identify adverse drug reactions based on what patients have written about their experiences. Synthetic BTACT is the data that we have made up. It is created based on questions similar to those in the Brief Test of Adult Cognition by Telephone (BTACT)(Tun et al., 2006).

CADEC includes two sets of audio files; one is read from the original CADEC, and the other one is with additional audio noise. It also includes the original CADEC scripts, annotations, and the transcripts of the noisy audio. The transcripts are generated using Whisper. The annotations encompass named entities, their types, and string indexes of their occurrence in the text. Annotations also include "AnnotatorNotes" which explains some of the annotations.

The synthetic BTACT data include two types: i) animals and ii) fruits. Similar to CADEC, it includes two sets of audio files: one that is read from the original scripts and another one with additional audio noise.  The text files include the original scripts, annotations, and the Whisper-transcribed of the noisy audio files. The annotations include indexes of named entities, their string indices and types.

REFERENCES:

Karimi, S., Metke-Jimenez, A., Kemp, M., & Wang, C. (2015). Cadec: A corpus of adverse drug event annotations. Journal of biomedical informatics, 55, 73-81.

Tun, P. A., & Lachman, M. E. (2006). Telephone assessment of cognitive function in adulthood: the Brief Test of Adult Cognition by Telephone. Age and Ageing, 35(6), 629-632.

DETAILS:

Data_1: CADEC (1250 TextFiles, 1000 Audio, types=5):

    General Categories and Counts
    ADR (Adverse Drug Reactions): 5316
    DRUG: 1797
    FINDING: 397
    DISEASE: 280
    SYMPTOM: 255
    Specific Items (Drugs) and Counts
    Arthrotec: 145
    cambia: 4
    cataflam: 10
    diclofenac-potassium: 3
    diclofenac-sodium: 7
    flector: 1
    Lipitor: 997
    Pennsaid: 4
    solarez: 3
    voltaren: 46
    voltaren-rx: 22
    zipsor: 5

Data_2: Synthetic BTACT (500 Fruits, 500 Animals, types=2)

>> Audios can be matched with annotations, scripts and transcripts using their filenames. 

---
audio [original&noisy]:
    1. cadec
        1.1 cadec original
        1.2 cadec noisy
    2. synthetic btact
        2.1 btact original
            2.1.1 fruits
                fruit-script[0:500].mp3
            2.1.2 animals
                script-[0:500].mp3
        2.2 btact noisy
            2.2.1 fruits
                fruit-script[0:500].mp3
            2.2.2 animals
                script-[0:500].mp3
text[scripts, annotations, transcripts]:
    1. cadec
        1.1 scripts [1,250]
        1.2 annotations [1,250] (index/AnnotatorsNote, type, indices, named-entities)
        1.3 transcripts [1,000]
    2. synthetic btact
        2.1 animals
            2.1 scripts (original scripts)
                script-[0:500].txt
            2.2 annotations
                script-[0:500].ann (index, type, start/end indices, named entity)
            2.3 transcripts
                script-[0:500].txt
        2.2. fruits
            2.1 scripts (original scripts)
                script-[0:500].txt
            2.2 annotations
                script-[0:500].ann (index, type, start/end indices, named entity)
            2.3 transcripts
                fruit-script-[0:500].txt

 

CITATION:

Ebadi, N., Morgan, K., Tan, A., Linares, B., Osborn, S., Majors, E., Davis, J., & Rios, A. (2024). Extracting biomedical entities from noisy audio transcripts. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024).

Files

bio-ner.zip

Files (1.7 GB)

Name Size Download all
md5:61b0692e34813256cca5be4a49d48ee2
1.7 GB Preview Download

Additional details

Dates

Other
2023-10-05
Created, Mixed, Transcribed