Published September 25, 2025 | Version v2
Dataset Open

Pixel-level Protected Health Information (PHI) - Supplement to Exploring AI-Based System Design for Pixel-Level Protected Health Information Detection in Medical Images

  • 1. ROR icon Bayer (Germany)
  • 2. Bayer AG

Description

This dataset includes two collections: RadPHI-test and MIDI. RadPHI-test and MIDI are derived datasets created by overlaying synthetically generated text on publicly available medical imaging datasets. All source images originate from open-access resources cited in the References section [1]–[6]. These datasets were further processed to generate synthetic imprints representing Protected Health Information (PHI) categories for research on medical image de-identification and related tasks.

Intended Use: This dataset is provided to support research and development in: (1) Medical image de-identification, computer vision, and related machine learning tasks (2) Method benchmarking and validation for academic, industrial, and non-profit research and (3) Education and reproducible science.
 
Disclaimer: This dataset contains images derived from open-access datasets under their respective licenses. The authors make no claim of ownership over the original images and have released this derivative work in accordance with the terms of those licenses. The synthetic overlays do not correspond to any real individuals. No real patient-identifiable information is present.
 

If you plan to use this dataset, please cite the following paper:

Truong, T., Baltruschat, I.M., Klemens, M. et al. Exploring AI-Based System Design for Pixel-Level Protected Health Information Detection in Medical Images. J Digit Imaging. Inform. med. (2025). https://doi.org/10.1007/s10278-025-01619-y

 

References

[1] Wasserthal J, Breit HC, Meyer MT, Pradella M, Hinck D, Sauter AW, Heye T, Boll DT, Cyriac J, Yang S, et al.: TotalSegmentator: robust segmentation of 104 anatomic structures in CT images. Radiol Artif Intell 5(5), 2023.

[2]Huang Z, Pu X, Tang G, Ping M, Jiang G, Wang M, Wei X, Ren Y: BS-80K: The first large open-access dataset of bone scan images. Comput Biol Med 151:106221, 2022.

[3] Wang X, Peng Y, Lu L, Lu Z, Bagheri M, Summers RM: ChestX-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2097–2106, 2017.

[4]Antonelli M, Reinke A, Bakas S, Farahani K, Kopp-Schneider A, Landman BA, Litjens G, Menze B, Ronneberger O, Summers RM, et al.: The medical segmentation decathlon. Nat Commun 13(1):4128, 2022.

[5] Farahani K, Clunie D, Klenk J, Kopchick B, Diaz M, Pan Q, Pei L, Prior F, Rutherford M, Singh A, Sutton G, Wagner U: Medical Image De-Identification Benchmark (MIDI-B). Available at https://www.synapse.org/Synapse:syn53065760 Accessed 16 April 2025.

[6] Rutherford MW, Nolan T, Pei L, Wagner U, Pan Q, Farmer P, Smith K, Kopchick B, Opsahl-Ong L, Sutton G, Clunie DA, Farahani K, Prior F: Data in support of the MIDI-B Challenge (MIDI-B-Synthetic-Validation, MIDI-B-Curated-Validation, MIDI-B-Synthetic-Test, MIDI-B-Curated-Test) (Version 1) [Data set]. The Cancer Imaging Archivehttps://doi.org/10.7937/cf2p-aw56, 2025

Files

data.zip

Files (337.2 MB)

Name Size Download all
md5:e0500555301475e89c9fe0d6bc35086a
337.2 MB Preview Download

Additional details

Related works

Is supplement to
Journal: 10.1007/s10278-025-01619-y (DOI)