Dataset Open Access
Antonio Miranda;
Aitor Gonzalez-Agirre;
Martin Krallinger
Introduction
These are the train, development, test and background sets of the CodiEsp corpus. Train and development have gold standard annotations. The unannotated background and test sets are distributed together. All documents are released in the context of the CodiEsp track for CLEF ehealth 2020 (http://temu.bsc.es/codiesp/).
The CodiEsp corpus contains manually coded clinical cases. All documents are in Spanish language and CIE10 is the coding terminology (it is the Spanish version of ICD10-CM and ICD10-PCS). The CodiEsp corpus has been randomly sampled into three subsets: the train, the development, and the test set. The train set contains 500 clinical cases, and the development and test set 250 clinical cases each. The test set contains 250 clinical cases and it is released together with the background set (with 2751 clinical cases). CodiEsp participants must submit predictions for the test and background set, but they will only be evaluated on the test set.
Zip structure
Three folders: train, dev and test. Each one of them contains the files for the train, development and test corpora, respectively.
Corpus format description
The CodiEsp corpus is distributed in plain text in UTF8 encoding, where each clinical case is stored as a single file whose name is the clinical case identifier. Annotations are released in a tab-separated file. Since the CodiEsp track has 3 sub-tracks, every set of documents (train and test) has 3 tab-separated files associated with it.
For the sub-tracks CodiEsp-D and CodiEsp-P, the file has the following fields:
articleID ICD10-code
Tab-separated files for the sub-track CodiEsp-X contain extra fields that provide the text-reference and its position:
articleID label ICD10-code text-reference reference-position
Corpus summary statistics
The final collection of 1000 clinical cases that make up the corpus had a total of 16504 sentences, with an average of 16.5 sentences per clinical case. It contains a total of 396,988 words, with an average of 396.2 words per clinical case.
For more information, visit the track webpage: http://temu.bsc.es/codiesp/
Name | Size | |
---|---|---|
codiesp.zip
md5:2fed02fdbf2116a7688041d17ecfdf1a |
11.1 MB | Download |
Villegas M, de la Peña S, Intxaurrondo A, Santamaria J, Krallinger M. Esfuerzos para fomentar la minería de textos en biomedicina más allá del inglés: el plan estratégico nacional español para las tecnologías del lenguaje. Procesamiento del Lenguaje Natural. 2017(59):141-4.
All versions | This version | |
---|---|---|
Views | 6,495 | 1,420 |
Downloads | 5,429 | 230 |
Data volume | 59.9 GB | 2.5 GB |
Unique views | 5,246 | 1,216 |
Unique downloads | 1,443 | 218 |