Published February 6, 2026 | Version v1.1
Dataset Open

MultiClinAI Shared Task Training Data

Description

The MultiClinAI (Multilingual Clinical Entity Annotation Projection and Extraction) shared task challenges participants to create systems that can automatically create multilingual versions of Gold Standard corpora from a seed language (in our case, Spanish) to six different target languages (Czech, Dutch, English, Italian, Romanian and Swedish). This is a process known as annotation projection. In parallel, participants are also challenged to create systems for clinical concept extraction in the seven languages of the task.

This repository currently includes the Training Data of the task, which includes part of the DisTEMIST, SympTEMIST and MedProcNER corpora, as well as the CardioCCC corpus, in the seven languages mentioned above. For more information about the creation process and context of this dataset, please visit the Data section of the task's website (linked below). As a note, the versions of the 

In short, the task is divided into two subtracks:

- Sub-task MultiClinNER (Multilingual Comparable Clinical Entity Recognition). This is a common Named Entity Recognition task; using the texts in each language, try to extract the entities contained in each text (Spanish, Czech, Dutch, English, Italian, Romanian and Swedish).

- Sub-task MultiClinCorpus (Multilingual Comparable Clinical Corpus Generation). This is an annotation projection task; starting from the original Spanish, try to obtain the annotations in the other languages (Czech, Dutch, English, Italian, Romanian and Swedish).

Both tasks will be evaluated using common classification metrics: precision, recall and F-1. An official evaluation library will be released soon. Also in both sub-tasks, teams can submit results for any target language. Submitting for all languages is not mandatory. Participants are free to create their systems in any way they want (i.e. monolingual or multilingual models, word alignment or generative models, unilabel or multilabel, …), and the use of creative solutions is encouraged.

File structure:

  • train_set:
    • CardioCCC_diseases
      • cz
        • ann and txt files
      • en
      • es
      • it
      • nl
      • ro
      • sv
    • CardioCCC_procedures
      • cz
      • en
      • ...
    • CardioCCC_symptoms
      • cz
      • en
      • ...
    • SPACCC_DisTEMIST_diseases
      • cz
      • en
      • ...
    • SPACCC_MedProcNER_procedures
      • cz
      • en
      • ...
    • SPACCC_SympTEMIST_symptoms
      • cz
      • en
      • ...

Resources 

  • Task Web
  • Annotation guidelines
  • Task Registration
  • Citation: Lima López, S., Rosell, J., Rodríguez Miret, J., Gallego-Donoso, F., & Krallinger, M. (2026). MultiClinAI Shared Task Training Data [Data set]. Zenodo. https://doi.org/10.5281/zenodo.18508039

 

Additional resources and corpora

At the NLP for Biomedical Information Analysis group (formerly Text Mining Unit), one of our missions is the open publication of datasets to train and benchmark biomedical information extraction, normalization and indexing systems. For that reason, we have released multiple datasets as part of shared tasks over the years. If you are interested in MultiClinAI, you might want to take a look at some of our resources and competitions about:

Contact: 

- Salvador Lima-López (<salvador [dot] limalopez [at] gmail [dot] com>)
- Martin Krallinger (<krallinger [dot] martin [at] gmail [dot] com>)

License: This work is licensed under a Creative Commons Attribution 4.0 International License.

 

Files

MultiClinAI-training_data-260206.zip

Files (81.9 MB)

Name Size Download all
md5:6a82fa15555fcdc39aedff1d0493d414
81.9 MB Preview Download