UPDATE: Zenodo migration postponed to Oct 13 from 06:00-08:00 UTC. Read the announcement.

Dataset Open Access

Cantemist corpus: gold standard of oncology clinical cases annotated with CIE-O 3 terminology

Antonio Miranda-Escalada; Farré, Eulàlia; Martin Krallinger

Intro:

Cantemist shared task dataset (divided in train, dev1, dev2 and test). In addition, we include here the Cantemist background set.

It contains the train, development and test sets of the three subtasks: cantemist-ner, cantemist-norm and cantemist-coding with Gold Standard annotations.

In addition, it contains the documents of the background set, without annotations.

 

Please cite if you use this dataset:

Miranda-Escalada, A., Farré, E., & Krallinger, M. (2020). Named entity recognition, concept normalization and clinical coding: Overview of the cantemist track for cancer text mining in spanish, corpus, guidelines, methods and results. In Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2020), CEUR Workshop Proceedings.

@inproceedings{miranda2020named,
  title={Named entity recognition, concept normalization and clinical coding: Overview of the cantemist track for cancer text mining in spanish, corpus, guidelines, methods and results},
  author={Miranda-Escalada, A and Farr{\'e}, E and Krallinger, M},
  booktitle={Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2020), CEUR Workshop Proceedings},
  year={2020}
}

 

Format:

For subtasks cantemist-norm and cantemist-ner, annotations are distributed in Brat format. See Brat webpage for more information

For subtask cantemist-coding, codes are grouped in a TSV file with the following columns (this follows the format used in CodiEsp shared task): 

filename    code

 

Shared task goal:

In the three subtasks, the goal will be to predict the annotations (either the ANN files or the TSV with the codes) given only the plain text files. 

 

Resources:

 

For further information, please visit https://temu.bsc.es/cantemist/ or email us at encargo-pln-life@bsc.es

Funded by the Plan de Impulso de las Tecnologías del Lenguaje (Plan TL).
Files (17.2 MB)
Name Size
cantemist.zip
md5:a219fcca9078f19490471c920a3b5816
17.2 MB Download
4,652
1,269
views
downloads
All versions This version
Views 4,6522,508
Downloads 1,269889
Data volume 19.5 GB15.3 GB
Unique views 3,8072,210
Unique downloads 937619

Share

Cite as