MEDDOCAN corpus: gold standard annotations for Medical Document Anonymization on Spanish clinical case reports

Published November 18, 2020 | Version 1.0

Dataset Open

Intro:

Meddocan shared task dataset (divided in train, dev and test). In addition, we include here the Meddocan background set.

It contains the training, development and test sets of the Meddocan shared task with Gold Standard annotations.

In addition, it contains the documents of the background set, without annotations.

Annotation quality

Inter-annotator agreement: 98%

For more information, see the paper.

Format:

Annotations are distributed in Brat format. See Brat webpage for more information.

In addition, annotations are also distributed in XML format (based on i2b2 XML format).

In the Meddocan webpage, there is a script to convert between MEDDOCAN-Brat, MEDDOCAN-XML, and i2b2 formats.

Shared task goal:

In the three subtasks, the goal will be to predict the annotations given only the plain text files.

Resources:

Web
Citation: Montserrat Marimon et al. “Automatic De-identification of Medical Texts in Spanish: the MEDDOCAN Track, Corpus, Guidelines, Methods and Evaluation of Results.” In: IberLEF@ SEPLN. 2019, pp. 618–638.
Silver Standard corpus
Annotation guidelines

For further information, please visit https://temu.bsc.es/meddocan/ or email us at encargo-pln-life@bsc.es

Notes

Funded by the Plan de Impulso de las Tecnologías del Lenguaje (Plan TL).

Name	Size	Download all
meddocan.zip md5:6a09eb975580fdf56bc7041eadc9c921	11.7 MB	Preview Download