De-identifying Spanish medical texts - Named Entity Recognition applied to radiology reports

Irene Pérez-Díez; Raúl Pérez-Moraga; Adolfo López-Cerdán; Marisa Caparrós Redondo; Jose-Maria Salinas-Serrano; María de la Iglesia-Vayá

doi:10.1101/2020.04.09.20058958

Published October 5, 2020 | Version v1

Preprint Open

De-identifying Spanish medical texts - Named Entity Recognition applied to radiology reports

1. 1FISABIO-CIPF Joint Research Unit in Biomedical Imaging
2. Health Informatics Department. Hospital San Juan de Alicante

Medical texts such as radiology reports or electronic health records are a powerful source of data for researchers. Anonymization methods must be developed to de-identify documents containing personal information from both patients and medical staff. Although currently there are several anonymization strategies for the English language, they are also languagedependent. Here, we introduce a named entity recognition strategy for Spanish medical texts, translatable to other languages. We tested 4 neural networks on our radiology reports dataset, achieving a recall of 97.18% of the identifying entities. Alongside, we developed a randomization algorithm to substitute the detected entities with new ones from the same category, making it virtually impossible to differentiate real data from synthetic data. The three best architectures were tested with the MEDDOCAN challenge dataset of electronic health records as
an external test, achieving a recall of 69.18%. The strategy proposed, combining named entity recognition tasks with randomization
of entities, is suitable for Spanish radiology reports. It does not require a big training corpus, thus it can be easily extended to other languages and medical texts, such as electronic health records.

Files

DISMED.pdf

Files (855.1 kB)

Name	Size	Download all
DISMED.pdf md5:3d6b28eb1a4cdf9fe5b608cf4cbc121b	855.1 kB	Preview Download

	All versions	This version
Views	101	101
Downloads	110	110
Data volume	97.5 MB	97.5 MB

De-identifying Spanish medical texts - Named Entity Recognition applied to radiology reports

Creators

Description

Files

DISMED.pdf

Files (855.1 kB)