Published May 29, 2025 | Version v7
Dataset Open

MultiClinSum Dataset: Summarization of Clinical Case Reports in English, Spanish, French and Portuguese

Description

MultiClinSum Shared Task Dataset

MultiClinSum is a shared task about the automatic summarization of clinical case reports in English, Spanish, French and Portuguese held as part of the BioASQ workshop at CLEF 2025. The task relies on a corpus of manually selected full clinical case reports and their corresponding clinical case report summaries derived from case report publications written in the previously mentioned languages. In addition, participants are allowed to use any other data source available online as long as they report it.

This repository includes the available datasets for the multilingual clinical summarization task. Each dataset contains pairs of full-text documents and their corresponding summaries.

  • multiclinsum_gs_train_en: Gold-standard training dataset in English, containing 592 full-text and summary pairs.
  • multiclinsum_gs_train_es: Gold-standard training dataset in Spanish, containing 592 full-text and summary pairs.
  • multiclinsum_gs_train_fr: Gold-standard training dataset in French, containing 592 full-text and summary pairs.
  • multiclinsum_gs_train_pt: Gold-standard training dataset in Portuguese, containing 592 full-text and summary pairs.
  • multiclinsum_large-scale_train_en: Large scale training dataset in English, containing 25.902 full-text and summary pairs.
  • multiclinsum_large-scale_train_es: Large scale training dataset in Spanish, containing 25.902 full-text and summary pairs.
  • multiclinsum_large-scale_train_fr: Large scale training dataset in French, containing 25.902 full-text and summary pairs.
  • multiclinsum_large-scale_train_pt: Large scale training dataset in Portuguese, containing 25.902 full-text and summary pairs.
  • multiclinsum_test_en: Test dataset in English, containing 3.396 full-text cases.
  • multiclinsum_test_es: Test dataset in Spanish, containing 3.406 full-text cases.
  • multiclinsum_test_fr: Test dataset in French, containing 3.469 full-text cases.
  • multiclinsum_test_pt: Test dataset in Portuguese, containing 3.442 full-text cases.

For each dataset, full-texts and summaries are organised in separate folders containing .txt files encoded in UTF-8. For a given language, files have nearly identical filenames, with summaries marked by the _sum suffix.

 

Resources:

- MultiClinSum website

- BioASQ website

License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Contact

If you have any questions or suggestions, please contact us at:

  • Miguel Rodríguez Ortega (<miguel [dot] rod [dot] bsc [at] gmail [dot] com>)
  • Eduard Rodríguez López (<edu4bsc [at] gmail [dot] com>)
  • Salvador Lima-López (<salvador [dot] limalopez [at] gmail [dot] com>)
  • Martin Krallinger (<krallinger [dot] martin [at] gmail [dot] com>)

Additional resources and corpora

If you are interested in MultiClinSum, you might want to check out these corpora and resources:

  • DisTEMIST (Corpus of disease mentions and normalization to SNOMED CT)
  • MedProcNER (Corpus of clinical procedure mentions and normalization to SNOMED CT)
  • SympTEMIST (Corpus of clinical findings and normalization to SNOMED CT)
  • DrugTEMIST (Corpus of medication mentions)
  • CardioCCC (Corpus of diseases and medication mentions in cardiology texts)
  • PharmaCoNER (Corpus of medications, drugs, chemical substances, genes, proteins and vaccine mentions and normalization)
  • MEDDOPROF (Corpus of mentions of professions, occupations and working status and normalization)
  • MEDDOPLACE (Corpus of mentions of place-related entity mentions, including departments, nationalities or patient movements etc.. and normalization)
  • MEDDOCAN (Corpus of mentions of Personal Health Identifiers (PHI))
  • CANTEMIST (Corpus of cancer tumor morphology mentions and normalization)
  • CodiESP (Corpus of clinical case reportes with assigned clinical codes from ICD10, Spanish version)
  • LivingNER (Corpus of mentions of species, including human/family members, pathogens, food, etc.. and normalization to NCBI Taxonomy)
  • SPACCC-POS (Corpus of clinical case reports in Spanish annotated with POS-tags)
  • SPACCC-TOKEN (Corpus of clinical case reports in Spanish annotated with token-tags (word mention boundaries))
  • SPACCC-SPLIT (Corpus of clinical case reports in Spanish annotated with sentence boundary-tags)
  • MESINESP-2 (Corpus of manually indexed records with DeCS /MeSH terms comprising scientific literature abstracts, clinical trials, and patent abstracts)

 

Files

multiclinsum_gs_train_en.zip

Files (316.7 MB)

Name Size Download all
md5:d92e821747c1a39ada10e16955002f89
1.5 MB Preview Download
md5:2e4cdd4f10398f8e979e7024bc34ec60
1.7 MB Preview Download
md5:37daf39fe1465113a970a29635e7f6dc
3.4 MB Preview Download
md5:728e2f9ce7234b62be3b51c04c83ef7e
1.6 MB Preview Download
md5:a11ea9572313beaa036ecf69ec82dd9c
66.6 MB Preview Download
md5:6d1444ff038d1fcc04321546ed9b6dcb
71.3 MB Preview Download
md5:3a169403826c04c0e34bc328cbb5d703
73.5 MB Preview Download
md5:8de528cde4a3532f1db972dcc70209c2
70.0 MB Preview Download
md5:fae7f2c9792bf296f173e39f9f43257d
6.3 MB Preview Download
md5:f260d85dbcb91f318e2e0bbb70c69d73
6.8 MB Preview Download
md5:de410894f6e119b77e39ee03de92ed50
7.1 MB Preview Download
md5:514ad96055bc4953b73e1ed35c31ee2a
6.7 MB Preview Download