CodiEsp corpus training and development set: Spanish clinical cases coded in ICD10 (CIE10) - eHealth CLEF2020

Antonio Miranda; Aitor Gonzalez-Agirre; Martin Krallinger

doi:10.5281/zenodo.3625747

Published January 23, 2020 | Version 1.0

Dataset Open

CodiEsp corpus training and development set: Spanish clinical cases coded in ICD10 (CIE10) - eHealth CLEF2020

1. Barcelona Supercomputing Center

These are the train and development sets of the CodiEsp corpus, released in the context of the CodiEsp track for CLEF ehealth 2020.

The CodiEsp corpus contains manually coded clinical cases. All documents are in Spanish language and CIE10 is the coding terminology (it is the Spanish version of ICD10-CM and ICD10-PCS). The CodiEsp corpus has been randomly sampled into three subsets: the train, the development, and the test set. The train set contains 500 clinical cases, and the development and test set 250 clinical cases each. The current version of the corpus does not contain the test set.

Corpus format description: The CodiEsp corpus is distributed in plain text in UTF8 encoding, where each clinical case is stored as a single file whose name is the clinical case identifier. Annotations are released in a tab-separated file. Since the CodiEsp track has 3 sub-tracks, every set of documents (train and test) has 3 tab-separated files associated with it.

For the sub-tracks 1 and 2, the file has the following fields:

articleID label ICD10-code text-reference

Tab-separated files for the third sub-track contain an extra field that provides the position in the text of the text-reference:

articleID label ICD10-code text-reference reference-position

Corpus summary statistics: The final collection of 1000 clinical cases that make up the corpus had a total of 16504 sentences, with an average of 16.5 sentences per clinical case. It contains a total of 396,988 words, with an average of 396.2 words per clinical case.

For more information, visit the track webpage: http://temu.bsc.es/codiesp/

Notes

Funded by the Plan de Impulso de las Tecnologías del Lenguaje (Plan TL).

Files

train_dev.zip

Files (1.5 MB)

Name	Size	Download all
train_dev.zip md5:9670dae8928376cfceb7573f82441ac5	1.5 MB	Preview Download

Additional details

Villegas M, de la Peña S, Intxaurrondo A, Santamaria J, Krallinger M. Esfuerzos para fomentar la minería de textos en biomedicina más allá del inglés: el plan estratégico nacional español para las tecnologías del lenguaje. Procesamiento del Lenguaje Natural. 2017(59):141-4.

	All versions	This version
Views	9,923	494
Downloads	2,342	58
Data volume	79.6 GB	88.6 MB

CodiEsp corpus training and development set: Spanish clinical cases coded in ICD10 (CIE10) - eHealth CLEF2020

Creators

Description

Notes

Files

train_dev.zip

Files (1.5 MB)

Additional details

References