MultiClinAI Shared Task Training Data

Lima López, Salvador; Rosell, Judith; Rodríguez Miret, Jan; Gallego-Donoso, Fernando; Krallinger, Martin

doi:10.5281/zenodo.18508039

Published February 6, 2026 | Version v1

Dataset Open

MultiClinAI Shared Task Training Data

1. Barcelona Supercomputing Center

The MultiClinAI (Multilingual Clinical Entity Annotation Projection and Extraction) shared task challenges participants to create systems that can automatically create multilingual versions of Gold Standard corpora from a seed language (in our case, Spanish) to six different target languages (Czech, Dutch, English, Italian, Romanian and Swedish). This is a process known as annotation projection. In parallel, participants are also challenged to create systems for clinical concept extraction in the seven languages of the task. This needs to be done for three different clinical information types: DISEASE, SYMPTOM and PROCEDURE.

This repository currently includes the Training Data of the task, which includes part of the DisTEMIST, SympTEMIST and MedProcNER corpora, as well as the CardioCCC corpus (expanded from its original use in the MultiCardioNER shared task), in the seven languages mentioned above. For more information about the creation process and context of this dataset, please visit the Data section of the task's website (linked below). As a note, each dataset in each language contains a slightly different version of the text due to translation revisions made during the annotation projection process, meaning they cannot be used for multilabel approaches.

In short, the task is divided into two subtracks:

- Sub-task MultiClinNER (Multilingual Comparable Clinical Entity Recognition). This is a common Named Entity Recognition task; using the texts in each language, try to extract the entities contained in each text (Spanish, Czech, Dutch, English, Italian, Romanian and Swedish).

- Sub-task MultiClinCorpus (Multilingual Comparable Clinical Corpus Generation). This is an annotation projection task; starting from the original Spanish, try to obtain the annotations in the other languages (Czech, Dutch, English, Italian, Romanian and Swedish).

Both tasks will be evaluated using common classification metrics: precision, recall and F-1. An official evaluation library will be released soon. Also in both sub-tasks, teams can submit results for any target language. Submitting for all languages is not mandatory. Participants are free to create their systems in any way they want (i.e. monolingual or multilingual models, word alignment or generative models, unilabel or multilabel, …), and the use of creative solutions is encouraged.

File structure:

train_set:
- CardioCCC_diseases
  - cz
    - ann and txt files
  - en
  - es
  - it
  - nl
  - ro
  - sv
- CardioCCC_procedures
  - cz
  - en
  - ...
- CardioCCC_symptoms
  - cz
  - en
  - ...
- SPACCC_DisTEMIST_diseases
  - cz
  - en
  - ...
- SPACCC_MedProcNER_procedures
  - cz
  - en
  - ...
- SPACCC_SympTEMIST_symptoms
  - cz
  - en
  - ...

Resources

If you this dataset, please cite:

@article{distemist2022overview,
title={Overview of DisTEMIST at BioASQ: Automatic detection and normalization of diseases from clinical texts: results, methods, evaluation and multilingual resources},
author={Miranda-Escalada, Antonio and Gascó, Luis and Lima-López, Salvador and Farré-Maduell, Eulàlia and Estrada, Darryl and Nentidis, Anastasios and Krithara, Anastasia and Katsimpras, Georgios and Paliouras, Georgios and Krallinger, Martin},
booktitle={Working Notes of Conference and Labs of the Evaluation (CLEF) Forum. CEUR Workshop Proceedings},
year={2022}
}

@inproceedings{symptemist2023overview,
title={Overview of SympTEMIST at BioCreative VIII: corpus, guidelines and evaluation of systems for the detection and normalization of symptoms, signs and findings from text},
author={Lima-L{\'o}pez, Salvador and Farr{\'e}-Maduell, Eul{\`a}lia and Gasco-S{\'a}nchez, Luis and Rodr{\'\i}guez-Miret, Jan and Krallinger, Martin}
}

@inproceedings{medprocner2023overview,
title={{Overview of MedProcNER task on medical procedure detection and entity linking at BioASQ 2023}},
author={Lima-L{\’o}pez, Salvador and Farr{\’e}-Maduell, Eul{\`a}lia and Gasc{\’o}, Luis and Nentidis, Anastasios and Krithara, Anastasia and Katsimpras, Georgios and Paliouras, Georgios and Krallinger, Martin},
booktitle={{Working Notes of CLEF 2023 – Conference and Labs of the Evaluation Forum}},
year={2023}
}

@inproceedings{multicardioner2024overview,
title={{Overview of MultiCardioNER task at BioASQ 2024 on Medical Speciality and Language Adaptation of Clinical NER Systems for Spanish, English and Italian}},
author={Salvador Lima-López and Eulàlia Farré-Maduell and Jan Rodríguez-Miret and Miguel Rodríguez-Ortega and Livia Lilli and Jacopo Lenkowicz and Giovanna Ceroni and Jonathan Kossoff and Anoop Shah and Anastasios Nentidis and Anastasia Krithara and Georgios Katsimpras and Georgios Paliouras and Martin Krallinger},
booktitle={CLEF Working Notes },
year={2024},
editor = {Faggioli, Guglielmo and Ferro, Nicola and Galuščáková, Petra and García Seco de Herrera, Alba}}

Additional resources and corpora

At the NLP for Biomedical Information Analysis group (formerly Text Mining Unit), one of our missions is the open publication of datasets to train and benchmark biomedical information extraction, normalization and indexing systems. For that reason, we have released multiple datasets as part of shared tasks over the years. If you are interested in MultiClinAI, you might want to take a look at some of our resources and competitions about:

Clinical content extraction: DisTEMIST (diseases), MedProcNER/ProcTEMIST (clinical procedures), SympTEMIST (signs and findings), CANTEMIST (tumour morphology), CodiEsp (coding to ICD), PharmaCoNER (chemicals and proteins), LivingNER (species and humans), MultiCardioNER (diseases and medications, includes the DrugTEMIST corpus as well as cardiology-specific data)
Socio-demographic / Social Determinants of Health content extraction: MEDDOPLACE (locations and more) MEDDOCAN (sensitive data), MEDDOPROF (occupations), ToxHabits (extraction of substance use-related content)
Information extraction in social media: SocialDisNER (diseases), ProfNER (occupations)
Linguistic aspects: BARR1 and BARR2 (abbreviation resolution)
Machine Translation: ClinSpEn (EN<->ES clinical content translation)
Summarization: MultiClinSUM (multilingual summarization of clinical content)

Contact:

- Salvador Lima-López (<salvador [dot] limalopez [at] gmail [dot] com>)
- Martin Krallinger (<krallinger [dot] martin [at] gmail [dot] com>)

License: This work is licensed under a Creative Commons Attribution 4.0 International License.

Files

MultiClinAI-training_data-260206.zip

Files (81.9 MB)

Name	Size	Download all
MultiClinAI-training_data-260206.zip md5:6a82fa15555fcdc39aedff1d0493d414	81.9 MB	Preview Download

	All versions	This version
Views	1,529	566
Downloads	281	62
Data volume	605.3 GB	5.2 GB

MultiClinAI Shared Task Training Data

Authors/Creators

Description

Files

MultiClinAI-training_data-260206.zip

Files (81.9 MB)