MIPHEI-ViT Dataset: Processed Restained H&E and mIF Data from Orion-CRC and HEMIT

Balezo, Guillaume; Trullo, Roger; Pla Planas, Albert; Decencière, Etienne; Walter, Thomas

doi:10.5281/zenodo.15340874

Published May 5, 2025 | Version v1.0

Dataset Open

MIPHEI-ViT Dataset: Processed Restained H&E and mIF Data from Orion-CRC and HEMIT

1. Sanofi (France)
2. Centre for Computational Biology (CBIO), Mines Paris
3. InstaDeep
4. Sanofi (Spain)
5. Centre de Morphologie Mathématique - CMM
6. Institut Curie
7. Inserm

Contributors

Researcher:

Balezo, Guillaume^{1, 2}

Supervisors:

1. Sanofi (France)
2. Centre for Computational Biology (CBIO), Mines Paris
3. InstaDeep
4. Sanofi (Spain)
5. Centre de Morphologie Mathématique - CMM
6. Institut Curie
7. Inserm

Overview

This dataset is released alongside our paper:
“MIPHEI-ViT: Multiplex Immunofluorescence Prediction from H&E Images using ViT Foundation Models”,
where we introduce a model for predicting mIF marker expression directly from H&E morphology using vision transformer (ViT) foundation models.

We provide a carefully preprocessed version of two existing public datasets, specifically tailored for the task of H&E-to-mIF image translation. The dataset contains aligned and restained Hematoxylin & Eosin (H&E) and multiplex immunofluorescence (mIF) image tiles. These preprocessed tiles, along with associated metadata, enable full reproducibility of our experiments—including model training, evaluation, and cell-level analysis—and can serve as a ready-to-use resource for future work in H&E-to-mIF translation and multimodal learning.

Source Datasets

The dataset is derived from the following open-source datasets, containing aligned restained H&E and mIF images:

ORION-CRC
Source: labsyspharm/ORION-CRC – Zenodo
Citation:
Lin J. labsyspharm/ORION-CRC [dataset]. Zenodo. 2023. doi:10.5281/zenodo.7637988
Lin, J., et al. "High-plex immunofluorescence imaging and traditional histology of the same tissue section for discovering image-based biomarkers," in Nature Cancer, vol. 4, no. 7, pp. 1036–1052, 2023.
License: MIT Licence
HEMIT
Source: Mendeley Data
Citation:
Bian C, Philips B, Cootes T, et al. HEMIT: H&E to Multiplex-immunohistochemistry Image Translation with Dual-Branch Pix2pix Generator. arXiv preprint arXiv:2403.18501, 2024.
License: CC BY 4.0

We also used data from the IMMUcan dataset for validation purposes; however, it is not redistributed here as the dataset remains private.

Preprocessing Pipeline

Our preprocessing steps include:

On ORION-CRC and HEMIT
- Nucleus segmentation with Cellpose on the DAPI channel
- Single-cell pseudo-labeling using Gaussian Mixture Models (GMM) based on mIF marker expression

On ORION-CRC only
- WSI-to-tile extraction at 20× magnification (ORION-CRC)
- Artifact filtering using:
  - Channel-based noise detection on mIF
  - Foundation model (H-optimus-0) feature clustering to remove H&E artifact tiles
- Autofluorescence subtraction using a custom Napari tool with marker-specific correction formulas
- Channel normalization via percentile clipping and log-transformation

These steps aim to produce high-confidence marker-positive cell annotations from noisy mIF data, enabling robust learning and evaluation on paired H&E images.

File Structure

The dataset is organized into the following archives and directories:

HEMIT_nuclei_analysis.zip
Preprocessed HEMIT data containing:
- Nuclei segmentation masks generated using Cellpose (40× resolution)
- Corresponding single-cell data in .csv format (per-cell intensities and cell types)
- ⚠️ This archive does not include the raw images from the original HEMIT dataset (License under the Creative Commons Attribution 4.0 International License (CC BY 4.0)). You must download the original dataset separately from Mendeley Data. The internal structure of this archive is designed to match the original, allowing direct integration of our nuclei and single-cell annotations.
ORION_dataset_20x.zip
ORION tile dataset at 20× magnification. Folder structure:
- he/ – JPEG tiles of H&E images
- if/ – 8-bit TIFF cleaned mIF images
- nuclei/ – Label TIFF nucleus masks
- csv_nuclei_pos/ – Per-WSI CSV files containing single-cell data: Nucleus position and cell types
- slide_dataframe.csv
  Dataframe that maps each slide (identified by slide_name) to its corresponding H&E, mIF, and nuclei WSI and CSV names.
  Columns include:
  - slide_name: Unique slide IDs
  - he_path: Paths to the H&E WSI
  - if_path: Paths to the mIF WSI
  - nuclei_path: Paths to the nucleus label WSI
  - nuclei_csv_path: Paths to the CSV file containing single-cell data (nucleus positions and cell types)
- train_dataframe.csv, val_dataframe.csv, test_dataframe.csv
  Dataframes containing tile-level metadata for each dataset split. Each row corresponds to a tile used during model training or evaluation.
  Columns include:
  - slide_name: IDs of the associated slide
  - image_path: Paths to the H&E tile
  - target_path: Paths to the corresponding mIF tile
  - nuclei_path: Paths to the nucleus label tile
ORION_dataset_20x_he_norm.zip
CycleGAN-normalized H&E images from ORION, transformed to match the staining style of IMMUcan data (20× resolution). These images can be used as augmentation during the training of MIPHEI-ViT.

To extract:
Use the following command:

7z x <path.zip>

Code & Tools

The full preprocessing pipeline used to produce this dataset — including tile extraction, autofluorescence correction, artifact removal, and single-cell analysis — is available at:

👉 GitHub Repository

This code allows you to reproduce our results and adapt the workflow to new datasets.

Citation

Please cite the associated paper and Zenodo DOI when using this dataset:

G. Balezo, et al, "MIPHEI-ViT: Multiplex Immunofluorescence Prediction from H&E Images using ViT Foundation Models," 2025 [DOI TO UPDATE]

Files

HEMIT_nuclei_analysis.zip

Files (146.1 GB)

Name	Size	Download all
HEMIT_nuclei_analysis.zip md5:365f95b1101f46379cc31ed32bd24cdb	582.2 MB	Preview Download
ORIONCRC_dataset_20x_he_norm.zip md5:2c95aadd8c5fff0185cd0a055a08d94d	18.5 GB	Preview Download
ORIONCRC_dataset_tile_20x.zip md5:fdc3188206ac68576b4195cd039d9061	127.0 GB	Preview Download

Additional details

Is derived from: Dataset: 10.5281/zenodo.7637988 (DOI)
Is supplement to: Dataset: 10.17632/3gx53zm49d.1 (DOI)

Repository URL: https://github.com/Sanofi-Public/MIPHEI-ViT
Programming language: Python

G. Balezo, et al, "MIPHEI-ViT: Multiplex Immunofluorescence Prediction from H&E Images using ViT Foundation Models," 2025
Lin, J., et al. "High-plex immunofluorescence imaging and traditional histology of the same tissue section for discovering image-based biomarkers," in Nature cancer, vol. 4, no. 7, pp. 1036–1052, 2023.
Bian, C., et al, "HEMIT: H&E to Multiplex-immunohistochemistry Image Translation with Dual-Branch Pix2pix Generator," in International Conference on Medical Image Computing and Computer-Assisted Intervention, 2024, pp. 184–197.

	All versions	This version
Views	428	428
Downloads	327	327
Data volume	94.8 TB	94.8 TB

MIPHEI-ViT Dataset: Processed Restained H&E and mIF Data from Orion-CRC and HEMIT

Contributors

Researcher:

Supervisors:

Overview

Source Datasets

Preprocessing Pipeline

File Structure

Code & Tools

Citation

Files

HEMIT_nuclei_analysis.zip

Files (146.1 GB)

Additional details

Related works

Software

References

MIPHEI-ViT Dataset: Processed Restained H&E and mIF Data from Orion-CRC and HEMIT

Creators

Contributors

Researcher:

Supervisors:

Description

Overview

Source Datasets

Preprocessing Pipeline

File Structure

Code & Tools

Citation

Files

HEMIT_nuclei_analysis.zip

Files (146.1 GB)

Additional details

Related works

Software

References