Published May 5, 2025 | Version v1.0
Dataset Open

MIPHEI-ViT Dataset: Processed Restained H&E and mIF Data from Orion-CRC and HEMIT

  • 1. ROR icon Sanofi (France)
  • 2. Centre for Computational Biology (CBIO), Mines Paris
  • 3. InstaDeep
  • 4. ROR icon Sanofi (Spain)
  • 5. ROR icon Centre de Morphologie Mathématique - CMM
  • 6. ROR icon Institut Curie
  • 7. ROR icon Inserm
  • 1. ROR icon Sanofi (France)
  • 2. Centre for Computational Biology (CBIO), Mines Paris
  • 3. InstaDeep
  • 4. ROR icon Sanofi (Spain)
  • 5. ROR icon Centre de Morphologie Mathématique - CMM
  • 6. ROR icon Institut Curie
  • 7. ROR icon Inserm

Description

Overview

This dataset is released alongside our paper:
“MIPHEI-ViT: Multiplex Immunofluorescence Prediction from H&E Images using ViT Foundation Models”,
where we introduce a model for predicting mIF marker expression directly from H&E morphology using vision transformer (ViT) foundation models.

We provide a carefully preprocessed version of two existing public datasets, specifically tailored for the task of H&E-to-mIF image translation. The dataset contains aligned and restained Hematoxylin & Eosin (H&E) and multiplex immunofluorescence (mIF) image tiles. These preprocessed tiles, along with associated metadata, enable full reproducibility of our experiments—including model training, evaluation, and cell-level analysis—and can serve as a ready-to-use resource for future work in H&E-to-mIF translation and multimodal learning.

Source Datasets

The dataset is derived from the following open-source datasets, containing aligned restained H&E and mIF images:

  • ORION-CRC
    Source: labsyspharm/ORION-CRC – Zenodo
    Citation:
    Lin J. labsyspharm/ORION-CRC [dataset]. Zenodo. 2023. doi:10.5281/zenodo.7637988
    Lin, J., et al. "High-plex immunofluorescence imaging and traditional histology of the same tissue section for discovering image-based biomarkers," in Nature Cancer, vol. 4, no. 7, pp. 1036–1052, 2023.
    License: MIT Licence

  • HEMIT
    Source: Mendeley Data
    Citation:
    Bian C, Philips B, Cootes T, et al. HEMIT: H&E to Multiplex-immunohistochemistry Image Translation with Dual-Branch Pix2pix Generator. arXiv preprint arXiv:2403.18501, 2024.
    License: CC BY 4.0

We also used data from the IMMUcan dataset for validation purposes; however, it is not redistributed here as the dataset remains private.

Preprocessing Pipeline

Our preprocessing steps include:

  • On ORION-CRC and HEMIT
    • Nucleus segmentation with Cellpose on the DAPI channel

    • Single-cell pseudo-labeling using Gaussian Mixture Models (GMM) based on mIF marker expression

  • On ORION-CRC only
    • WSI-to-tile extraction at 20× magnification (ORION-CRC)

    • Artifact filtering using:

      • Channel-based noise detection on mIF

      • Foundation model (H-optimus-0) feature clustering to remove H&E artifact tiles

    • Autofluorescence subtraction using a custom Napari tool with marker-specific correction formulas

    • Channel normalization via percentile clipping and log-transformation

These steps aim to produce high-confidence marker-positive cell annotations from noisy mIF data, enabling robust learning and evaluation on paired H&E images.

File Structure

The dataset is organized into the following archives and directories:

  • HEMIT_nuclei_analysis.zip
    Preprocessed HEMIT data containing:

    • Nuclei segmentation masks generated using Cellpose (40× resolution)

    • Corresponding single-cell data in .csv format (per-cell intensities and cell types)

    • ⚠️ This archive does not include the raw images from the original HEMIT dataset (License under the Creative Commons Attribution 4.0 International License (CC BY 4.0)). You must download the original dataset separately from Mendeley Data. The internal structure of this archive is designed to match the original, allowing direct integration of our nuclei and single-cell annotations.
  • ORION_dataset_20x.zip
    ORION tile dataset at 20× magnification. Folder structure:

    • he/ – JPEG tiles of H&E images

    • if/ – 8-bit TIFF cleaned mIF images

    • nuclei/ – Label TIFF nucleus masks

    • csv_nuclei_pos/ – Per-WSI CSV files containing single-cell data: Nucleus position and cell types

    • slide_dataframe.csv
      Dataframe that maps each slide (identified by slide_name) to its corresponding H&E, mIF, and nuclei WSI and CSV names.
      Columns include:

      • slide_name: Unique slide IDs

      • he_path: Paths to the H&E WSI

      • if_path: Paths to the mIF WSI

      • nuclei_path: Paths to the nucleus label WSI

      • nuclei_csv_path: Paths to the CSV file containing single-cell data (nucleus positions and cell types)
    • train_dataframe.csv, val_dataframe.csv, test_dataframe.csv
      Dataframes containing tile-level metadata for each dataset split. Each row corresponds to a tile used during model training or evaluation.
      Columns include:

      • slide_name: IDs of the associated slide

      • image_path: Paths to the H&E tile

      • target_path: Paths to the corresponding mIF tile

      • nuclei_path: Paths to the nucleus label tile

  • ORION_dataset_20x_he_norm.zip
    CycleGAN-normalized H&E images from ORION, transformed to match the staining style of IMMUcan data (20× resolution). These images can be used as augmentation during the training of MIPHEI-ViT.

To extract:
Use the following command:

7z x <path.zip>

Code & Tools

The full preprocessing pipeline used to produce this dataset — including tile extraction, autofluorescence correction, artifact removal, and single-cell analysis — is available at:

👉 GitHub Repository

This code allows you to reproduce our results and adapt the workflow to new datasets.

Citation

Please cite the associated paper and Zenodo DOI when using this dataset:

G. Balezo, et al, "MIPHEI-ViT: Multiplex Immunofluorescence Prediction from H&E Images using ViT Foundation Models," 2025 [DOI TO UPDATE]

Files

HEMIT_nuclei_analysis.zip

Files (146.1 GB)

Name Size Download all
md5:365f95b1101f46379cc31ed32bd24cdb
582.2 MB Preview Download
md5:2c95aadd8c5fff0185cd0a055a08d94d
18.5 GB Preview Download
md5:fdc3188206ac68576b4195cd039d9061
127.0 GB Preview Download

Additional details

Related works

Is derived from
Dataset: 10.5281/zenodo.7637988 (DOI)
Is supplement to
Dataset: 10.17632/3gx53zm49d.1 (DOI)

Software

Repository URL
https://github.com/Sanofi-Public/MIPHEI-ViT
Programming language
Python

References

  • G. Balezo, et al, "MIPHEI-ViT: Multiplex Immunofluorescence Prediction from H&E Images using ViT Foundation Models," 2025
  • Lin, J., et al. "High-plex immunofluorescence imaging and traditional histology of the same tissue section for discovering image-based biomarkers," in Nature cancer, vol. 4, no. 7, pp. 1036–1052, 2023.
  • Bian, C., et al, "HEMIT: H&E to Multiplex-immunohistochemistry Image Translation with Dual-Branch Pix2pix Generator," in International Conference on Medical Image Computing and Computer-Assisted Intervention, 2024, pp. 184–197.