Planned intervention: On Thursday 19/09 between 05:30-06:30 (UTC), Zenodo will be unavailable because of a scheduled upgrade in our storage cluster.
Published March 15, 2022 | Version v3
Dataset Open

MEDDOPROF corpus: complete gold standard annotations for occupation detection in medical documents in Spanish

Description

UPDATE 27/09/2022: A complete normalization of all mentions in the corpus to SNOMED CT has been added to the 'meddoprof-norm.tsv' file.

Description

This repository contains the complete MEDDOPROF Gold Standard, a collection of 1,844 clinical cases in Spanish with annotations for occupations, working statuses and activities. MEDDOPROF is a Shared Task celebrated in 2021 that explores the application of natural language processing to occupational health. If you'd like to learn more, please visit: https://temu.bsc.es/meddoprof.

Folder and File Structure

The corpus' files are presented in the format used by the annotation tool brat. That is, for each clinical case there is a .txt file with the text and a .ann file with its corresponding annotations.

- meddoprof-ner/

Clinical cases annotated with these labels: PROFESION (PROFESSION), SITUACION_LABORAL (WORKING_STATUS) or ACTIVIDAD (ACTIVIDAD).

- meddoprof-class/

Clinical cases with the same annotations as 'meddoprof-ner' but with these labels instead: PACIENTE (patient), FAMILIAR (family member), SANITARIO (health professional) or OTRO (other).

- ner_class_joint/

Clinical cases with both levels of annotation (ner and class) joint (that is, a mention classified as as PROFESOR in meddoprof-ner and as PACIENTE in meddoprof-class would be PROFESION-PACIENTE here).

- meddoprof-norm.tsv

Tab-separated file (.tsv) with the mapping of each mention in the corpus to ESCO and SNOMED CT. The file has five columns: filename, mention text, span, ESCO code and SNOMED code.

Additionally, two files with the filenames of the train and test partitions are included.

 

Please cite if you use this resource:

Salvador Lima-López, Eulàlia Farré-Maduell, Antonio Miranda-Escalada, Vicent Brivá-Iglesias and Martin Krallinger. NLP applied to occupational health: MEDDOPROF shared task at IberLEF 2021 on automatic recognition, classification and normalization of professions and occupations from medical texts. In Procesamiento del Lenguaje Natural, 67. 2021.

@article{meddoprof,
    title={NLP applied to occupational health: MEDDOPROF shared task at IberLEF 2021 on automatic recognition, classification and normalization of professions and occupations from medical texts},
    author={Lima-López, Salvador and Farré-Maduell, Eulàlia and Miranda-Escalada, Antonio and Brivá-Iglesias, Vicent and Krallinger, Martin},
journal = {Procesamiento del Lenguaje Natural},
volume = {67},
year={2021},
issn = {1989-7553},
url = {http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6393},
pages = {243--256}
}

Related Resources:

- Web

- Training Data

- Test set

- Codes Reference List (for MEDDOPROF-NORM)

- Annotation Guidelines

- Occupations Gazetteer

 

MEDDOPROF is part of the IberLEF 2021 workshop, which is co-located with the SEPLN 2021 conference. For further information, please visit https://temu.bsc.es/meddoprof/ or email us at encargo-pln-life@bsc.es

MEDDOPROF is promoted by the Plan de Impulso de las Tecnologías del Lenguaje de la Agenda Digital (Plan TL) and the Spanish government's 2020 Proyectos de I+D+i RTI Tipo A (AI4PROFHEALTH - DESCIFRANDO EL PAPEL DE LAS PROFESIONES EN LA SALUD DE LOS PACIENTES A TRAVES DE LA MINERIA DE TEXTOS (PID2020-119266RA-I00)).

Files

MEDDOPROF_GS.zip

Files (14.1 MB)

Name Size Download all
md5:58b641fe2bc31b934b7566a4506f3704
14.1 MB Preview Download