Planned intervention: On Wednesday April 3rd 05:30 UTC Zenodo will be unavailable for up to 2-10 minutes to perform a storage cluster upgrade.
Published November 18, 2020 | Version 1.0
Dataset Open

MEDDOCAN corpus: gold standard annotations for Medical Document Anonymization on Spanish clinical case reports

  • 1. Barcelona Supercomputing Center
  • 2. Centro Nacional de Investigaciones Oncológicas
  • 3. Hospital 12 de Octubre

Description

Intro:

Meddocan shared task dataset (divided in train, dev and test). In addition, we include here the Meddocan background set.

It contains the training, development and test sets of the Meddocan shared task with Gold Standard annotations.

In addition, it contains the documents of the background set, without annotations.

 

Annotation quality

Inter-annotator agreement: 98% 

For more information, see the paper

 

Format:

Annotations are distributed in Brat format. See Brat webpage for more information.

In addition, annotations are also distributed in XML format (based on i2b2 XML format).

In the Meddocan webpage, there is a script to convert between MEDDOCAN-Brat, MEDDOCAN-XML, and i2b2 formats.

 

Shared task goal:

In the three subtasks, the goal will be to predict the annotations given only the plain text files. 

 

Resources:

  • Web
  • Citation: Montserrat Marimon et al. “Automatic De-identification of Medical Texts in Spanish: the MEDDOCAN Track, Corpus, Guidelines, Methods and Evaluation of Results.” In: IberLEF@ SEPLN. 2019, pp. 618–638.
  • Silver Standard corpus
  • Annotation guidelines

 

For further information, please visit https://temu.bsc.es/meddocan/ or email us at encargo-pln-life@bsc.es

Copyright (c) 2019 Secretaría de Estado para el Avance Digital (SEAD)

Notes

Funded by the Plan de Impulso de las Tecnologías del Lenguaje (Plan TL).

Files

meddocan.zip

Files (11.7 MB)

Name Size Download all
md5:6a09eb975580fdf56bc7041eadc9c921
11.7 MB Preview Download