MultiGraSCCo - A Multilingual version of the Graz Synthetic Clinical text Corpus with Annotations of Personal Identifiers
Authors/Creators
Description
MultiGraSCCo - A Multilingual version of the Graz Synthetic Clinical text Corpus with Annotations of Personal Information
This repository is an external resource of:
- Baroud, I., Otto, C., Czehmann, V., Hovhannisyan, C., Raithel, L., Möller, S., Roller, R. (2026). MultiGraSCCo: A Multilingual Anonymization Benchmark with Annotations of Personal Identifiers. arXiv preprint arXiv:2603.08879.
The Graz Synthetic Clinical text Corpus (GraSCCo) is a dataset that contains artificially generated semi-structured and unstructured German-language clinical summaries. These summaries are formulated as letters from the hospital to the patient's GP after in-patient or out-patient care. Further details:
- Stefan Schulz. (2022). GraSCCo (Version v1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6539131
- Modersohn L, Schulz S, Lohr C, Hahn U. GRASCCO - The First Publicly Shareable, Multiply-Alienated German Clinical Text Corpus. Stud Health Technol Inform. 2022;296:66-72. doi:10.3233/SHTI220805
This work extends the annotations of Proteced Health Information in GraSCCo introduced in the following resource with annotations of Indirect Personal Identifiers (IPI) such as information about family, lifestyle, and the socioeconomic and criminal history of the patient:
- Lohr, C., Matthies, F., Jakob, F., Modersohn, L., Riedel, A., Hahn, U., Kiser, R., Boeker, M., & Meineke, F. (2024). GraSCCo_PHI - Graz Synthetic Clinical text Corpus with Protected Health Information Annotations [Data set]. Zenodo. https://doi.org/10.5281/zenodo.11502329
We use and further develop the following guiedelines for annotating IPIs in GraSCCo:
- Baroud, I., Raithel, L., Möller, S., & Roller, R. (2025). MIMIC_III_IPI - Discharge Summaries from MIMIC-III with Indirect Personal Identifiers Annotations [Data set]. Zenodo. https://doi.org/10.5281/zenodo.15372705
In this work, GraSCCo, together with annotations of direct and indirect personal information was translated into 9 languages from 6 language families and 3 scripts. MultiGraSCCo includes the following language families/languages: German, English (Germanic); Italian, French (Romance); Arabic (Semitic); Polish, Russian, Ukrainian (Slavic); Turkish (Turkic); and Persian (Indo-Iranian).
The repository contains the annotations of PHI and IPI information in JSON format in 10 languages as well as the IPI annotation guidelines.
Files
IPI - Annotation Guidelines - GraSCCo.pdf
Files
(2.9 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:00514fff5ecb2bedbf2b096a3f9dad70
|
198.4 kB | Preview Download |
|
md5:fdbbd13b4adf73cf61808347118196b2
|
2.7 MB | Preview Download |
Additional details
Related works
- Continues
- Dataset: 10.5281/zenodo.6539131 (DOI)
- Dataset: 10.5281/zenodo.11502329 (DOI)