Published April 9, 2026 | Version v2
Dataset Open

MultiGraSCCo - A Multilingual version of the Graz Synthetic Clinical text Corpus with Annotations of Personal Identifiers

  • 1. ROR icon Technische Universität Berlin
  • 2. German Research Center for Artificial Intelligence (DFKI)
  • 3. EDMO icon German Research Center for Artificial Intelligence
  • 4. Deutsches Forschungszentrum für Künstliche Intelligenz GmbH
  • 5. DFKI

Description

MultiGraSCCo - A Multilingual version of the Graz Synthetic Clinical text Corpus with Annotations of Personal Information

This repository is an external resource of: 

  • Baroud, I., Otto, C., Czehmann, V., Hovhannisyan, C., Raithel, L., Möller, S., Roller, R. (2026). MultiGraSCCo: A Multilingual Anonymization Benchmark with Annotations of Personal Identifiers. arXiv preprint arXiv:2603.08879.

The Graz Synthetic Clinical text Corpus (GraSCCo) is a dataset that contains artificially generated semi-structured and unstructured German-language clinical summaries. These summaries are formulated as letters from the hospital to the patient's GP after in-patient or out-patient care. Further details:

  • Stefan Schulz. (2022). GraSCCo (Version v1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6539131
  • Modersohn L, Schulz S, Lohr C, Hahn U. GRASCCO - The First Publicly Shareable, Multiply-Alienated German Clinical Text Corpus. Stud Health Technol Inform. 2022;296:66-72. doi:10.3233/SHTI220805

This work extends the annotations of Proteced Health Information in GraSCCo introduced in the following resource with annotations of Indirect Personal Identifiers (IPI) such as information about family, lifestyle, and the socioeconomic and criminal history of the patient:

  • Lohr, C., Matthies, F., Jakob, F., Modersohn, L., Riedel, A., Hahn, U., Kiser, R., Boeker, M., & Meineke, F. (2024). GraSCCo_PHI - Graz Synthetic Clinical text Corpus with Protected Health Information Annotations [Data set]. Zenodo. https://doi.org/10.5281/zenodo.11502329

We use and further develop the following guiedelines for annotating IPIs in GraSCCo: 

  • Baroud, I., Raithel, L., Möller, S., & Roller, R. (2025). MIMIC_III_IPI - Discharge Summaries from MIMIC-III with Indirect Personal Identifiers Annotations [Data set]. Zenodo. https://doi.org/10.5281/zenodo.15372705

In this work, GraSCCo, together with annotations of direct and indirect personal information was translated into 9 languages from 6 language families and 3 scripts. MultiGraSCCo includes the following language families/languages: German, English (Germanic); Italian, French (Romance); Arabic (Semitic); Polish, Russian, Ukrainian (Slavic); Turkish (Turkic); and Persian (Indo-Iranian). 

The repository contains the annotations of PHI and IPI information in JSON format in 10 languages as well as the IPI annotation guidelines.

Files

IPI - Annotation Guidelines - GraSCCo.pdf

Files (2.9 MB)

Name Size Download all
md5:00514fff5ecb2bedbf2b096a3f9dad70
198.4 kB Preview Download
md5:fdbbd13b4adf73cf61808347118196b2
2.7 MB Preview Download

Additional details

Related works

Continues
Dataset: 10.5281/zenodo.6539131 (DOI)
Dataset: 10.5281/zenodo.11502329 (DOI)