There is a newer version of the record available.

Published May 16, 2025 | Version v2
Dataset Open

MultiClinSum Dataset: Summarization of Clinical Case Reports in English, Spanish, French and Portuguese

Description

MultiClinSum Shared Task Dataset

MultiClinSum is a shared task about the automatic summarization of clinical case reports in English, Spanish, French and Portuguese held as part of the BioASQ workshop at CLEF 2025. The task relies on a corpus of manually selected full clinical case reports and their corresponding clinical case report summaries derived from case report publications written in the previously mentioned languages. In addition, participants are allowed to use any other data source available online as long as they report it.

This repository includes the available datasets for the multilingual clinical summarization task. Each dataset contains pairs of full-text documents and their corresponding summaries.

  • multiclinsum_gs_train_en: Gold-standard training dataset in English, containing 592 full-text and summary pairs.
  • multiclinsum_gs_train_es: Gold-standard training dataset in Spanish, containing 592 full-text and summary pairs.
  • multiclinsum_gs_train_fr: Gold-standard training dataset in French, containing 592 full-text and summary pairs.
  • multiclinsum_gs_train_pt: Gold-standard training dataset in Portuguese, containing 592 full-text and summary pairs.

For each dataset, full-texts and summaries are organised in separate folders containing .txt files encoded in UTF-8. For a given language, files have nearly identical filenames, with summaries marked by the _sum suffix.

 

Resources:

- MultiClinSum website

- BioASQ website

License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Contact

If you have any questions or suggestions, please contact us at:

  • Miguel Rodríguez Ortega (<miguel [dot] rod [dot] bsc [at] gmail [dot] com>)
  • Salvador Lima-López (<salvador [dot] limalopez [at] gmail [dot] com>)
  • Martin Krallinger (<krallinger [dot] martin [at] gmail [dot] com>)

Additional resources and corpora

If you are interested in MultiClinSum, you might want to check out these corpora and resources:

  • DisTEMIST (Corpus of disease mentions and normalization to SNOMED CT)
  • MedProcNER (Corpus of clinical procedure mentions and normalization to SNOMED CT)
  • SympTEMIST (Corpus of clinical findings and normalization to SNOMED CT)
  • DrugTEMIST (Corpus of medication mentions)
  • CardioCCC (Corpus of diseases and medication mentions in cardiology texts)
  • PharmaCoNER (Corpus of medications, drugs, chemical substances, genes, proteins and vaccine mentions and normalization)
  • MEDDOPROF (Corpus of mentions of professions, occupations and working status and normalization)
  • MEDDOPLACE (Corpus of mentions of place-related entity mentions, including departments, nationalities or patient movements etc.. and normalization)
  • MEDDOCAN (Corpus of mentions of Personal Health Identifiers (PHI))
  • CANTEMIST (Corpus of cancer tumor morphology mentions and normalization)
  • CodiESP (Corpus of clinical case reportes with assigned clinical codes from ICD10, Spanish version)
  • LivingNER (Corpus of mentions of species, including human/family members, pathogens, food, etc.. and normalization to NCBI Taxonomy)
  • SPACCC-POS (Corpus of clinical case reports in Spanish annotated with POS-tags)
  • SPACCC-TOKEN (Corpus of clinical case reports in Spanish annotated with token-tags (word mention boundaries))
  • SPACCC-SPLIT (Corpus of clinical case reports in Spanish annotated with sentence boundary-tags)
  • MESINESP-2 (Corpus of manually indexed records with DeCS /MeSH terms comprising scientific literature abstracts, clinical trials, and patent abstracts)

 

Files

multiclinsum_gs_train_en.zip

Files (8.3 MB)

Name Size Download all
md5:d92e821747c1a39ada10e16955002f89
1.5 MB Preview Download
md5:2e4cdd4f10398f8e979e7024bc34ec60
1.7 MB Preview Download
md5:37daf39fe1465113a970a29635e7f6dc
3.4 MB Preview Download
md5:728e2f9ce7234b62be3b51c04c83ef7e
1.6 MB Preview Download