Published November 21, 2023 | Version v1
Other Open

CARMEN-I: Anonymization Protocol for Clinical Reports in Spanish

Description

CARMEN-I is a corpus of 2,000 de-identified clinical records generated at the Hospital Clínic of Barcelona (HCB) from March 2020 to March 2022, during the height of the COVID-19 pandemic, and developed in collaboration with the Barcelona Supercomputing Center (BSC). It consists of discharge letters, referrals and radiology reports written mainly in Spanish, with some sections in Catalan. The corpus covers patients admitted with COVID-19, and includes a wide variety of comorbidities, such as kidney failure, chronic cardiovascular and respiratory diseases, malignancies and immunosuppression. CARMEN-I has been exhaustively anonymized and validated by hospital physicians, natural language processing experts and linguists, following detailed annotation guidelines, and replacing original sensitive data elements by synthetic equivalents. A subset of the corpus has been annotated with key medical concepts labeled by experts, namely, symptoms, diseases, procedures, medications, species and humans (incl. family members), using an annotation scheme based on previously-released biomedical corpora such as DisTEMIST, ProcTEMIST or LivingNER.

This repository includes the anonymization protocol in Spanish. This document describes the protocol created for the data anonymization process, as well as the control mechanisms put in place for this purpose. It also includes addenda to the MEDDOCAN guidelines for the annotation of sensitive data, criteria for inclusion/exclusion of documents, and a list of indirect identifiers.

CARMEN-I is available on PhysioNet under demand.

Other relevant links:

If you use this document, please cite:

@article{LimaLopez2025,
author = {Salvador Lima-López and Eulàlia Farré-Maduell and Luis Gasco and Jan Rodríguez-Miret and Santiago Frid and Xavier Pastor and Xavier Borrat and Martin Krallinger},
title = {A textual dataset of de-identified health records in Spanish and Catalan for medical entity recognition and anonymization},
journal = {Scientific Data},
volume = {12},
pages = {Article 1088},
year = {2025},
publisher = {Nature Publishing Group},
doi = {10.1038/s41597-025-05320-1},
url = {https://www.nature.com/articles/s41597-025-05320-1}
}

Files

[HCB-BSC] Protocolo y criterios anonimización.pdf

Files (2.1 MB)

Additional details

Related works

Is variant form of
Other: 10.5281/zenodo.10171681 (DOI)