MultiClinAI Shared Task Training Data
Authors/Creators
- 1. Barcelona Supercomputing Center
Description
The MultiClinAI (Multilingual Clinical Entity Annotation Projection and Extraction) shared task challenges participants to create systems that can automatically create multilingual versions of Gold Standard corpora from a seed language (in our case, Spanish) to six different target languages (Czech, Dutch, English, Italian, Romanian and Swedish). This is a process known as annotation projection. In parallel, participants are also challenged to create systems for clinical concept extraction in the seven languages of the task.
This repository currently includes the Training Data of the task, which includes part of the DisTEMIST, SympTEMIST and MedProcNER corpora, as well as the CardioCCC corpus, in the seven languages mentioned above. For more information about the creation process and context of this dataset, please visit the Data section of the task's website (linked below). As a note, the versions of the
In short, the task is divided into two subtracks:
- Sub-task MultiClinNER (Multilingual Comparable Clinical Entity Recognition). This is a common Named Entity Recognition task; using the texts in each language, try to extract the entities contained in each text (Spanish, Czech, Dutch, English, Italian, Romanian and Swedish).
- Sub-task MultiClinCorpus (Multilingual Comparable Clinical Corpus Generation). This is an annotation projection task; starting from the original Spanish, try to obtain the annotations in the other languages (Czech, Dutch, English, Italian, Romanian and Swedish).
Both tasks will be evaluated using common classification metrics: precision, recall and F-1. An official evaluation library will be released soon. Also in both sub-tasks, teams can submit results for any target language. Submitting for all languages is not mandatory. Participants are free to create their systems in any way they want (i.e. monolingual or multilingual models, word alignment or generative models, unilabel or multilabel, …), and the use of creative solutions is encouraged.
File structure:
- train_set:
- CardioCCC_diseases
- cz
- ann and txt files
- en
- es
- it
- nl
- ro
- sv
- cz
- CardioCCC_procedures
- cz
- en
- ...
- CardioCCC_symptoms
- cz
- en
- ...
- SPACCC_DisTEMIST_diseases
- cz
- en
- ...
- SPACCC_MedProcNER_procedures
- cz
- en
- ...
- SPACCC_SympTEMIST_symptoms
- cz
- en
- ...
- CardioCCC_diseases
Resources
- Task Web
- Annotation guidelines
- Task Registration
- Citation: Lima López, S., Rosell, J., Rodríguez Miret, J., Gallego-Donoso, F., & Krallinger, M. (2026). MultiClinAI Shared Task Training Data [Data set]. Zenodo. https://doi.org/10.5281/zenodo.18508039
Additional resources and corpora
At the NLP for Biomedical Information Analysis group (formerly Text Mining Unit), one of our missions is the open publication of datasets to train and benchmark biomedical information extraction, normalization and indexing systems. For that reason, we have released multiple datasets as part of shared tasks over the years. If you are interested in MultiClinAI, you might want to take a look at some of our resources and competitions about:
- Clinical content extraction: DisTEMIST (diseases), MedProcNER/ProcTEMIST (clinical procedures), SympTEMIST (signs and findings), CANTEMIST (tumour morphology), CodiEsp (coding to ICD), PharmaCoNER (chemicals and proteins), LivingNER (species and humans), MultiCardioNER (diseases and medications, includes the DrugTEMIST corpus as well as cardiology-specific data)
- Socio-demographic / Social Determinants of Health content extraction: MEDDOPLACE (locations and more) MEDDOCAN (sensitive data), MEDDOPROF (occupations), ToxHabits (extraction of substance use-related content)
- Information extraction in social media: SocialDisNER (diseases), ProfNER (occupations)
- Linguistic aspects: BARR1 and BARR2 (abbreviation resolution)
- Machine Translation: ClinSpEn (EN<->ES clinical content translation)
- Summarization: MultiClinSUM (multilingual summarization of clinical content)
Contact:
- Salvador Lima-López (<salvador [dot] limalopez [at] gmail [dot] com>)
- Martin Krallinger (<krallinger [dot] martin [at] gmail [dot] com>)
License: This work is licensed under a Creative Commons Attribution 4.0 International License.
Files
MultiClinAI-training_data-260206.zip
Files
(81.9 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:6a82fa15555fcdc39aedff1d0493d414
|
81.9 MB | Preview Download |