MIPHEI-ViT Dataset: Processed Restained H&E and mIF Data from Orion-CRC and HEMIT
Creators
Contributors
Researcher:
Description
Overview
This dataset is released alongside our paper:
“MIPHEI-ViT: Multiplex Immunofluorescence Prediction from H&E Images using ViT Foundation Models”,
where we introduce a model for predicting mIF marker expression directly from H&E morphology using vision transformer (ViT) foundation models.
We provide a carefully preprocessed version of two existing public datasets, specifically tailored for the task of H&E-to-mIF image translation. The dataset contains aligned and restained Hematoxylin & Eosin (H&E) and multiplex immunofluorescence (mIF) image tiles. These preprocessed tiles, along with associated metadata, enable full reproducibility of our experiments—including model training, evaluation, and cell-level analysis—and can serve as a ready-to-use resource for future work in H&E-to-mIF translation and multimodal learning.
Source Datasets
The dataset is derived from the following open-source datasets, containing aligned restained H&E and mIF images:
-
ORION-CRC
Source: labsyspharm/ORION-CRC – Zenodo
Citation:
Lin J. labsyspharm/ORION-CRC [dataset]. Zenodo. 2023. doi:10.5281/zenodo.7637988
Lin, J., et al. "High-plex immunofluorescence imaging and traditional histology of the same tissue section for discovering image-based biomarkers," in Nature Cancer, vol. 4, no. 7, pp. 1036–1052, 2023.
License: MIT Licence -
HEMIT
Source: Mendeley Data
Citation:
Bian C, Philips B, Cootes T, et al. HEMIT: H&E to Multiplex-immunohistochemistry Image Translation with Dual-Branch Pix2pix Generator. arXiv preprint arXiv:2403.18501, 2024.
License: CC BY 4.0
We also used data from the IMMUcan dataset for validation purposes; however, it is not redistributed here as the dataset remains private.
Preprocessing Pipeline
Our preprocessing steps include:
- On ORION-CRC and HEMIT
-
Nucleus segmentation with Cellpose on the DAPI channel
-
Single-cell pseudo-labeling using Gaussian Mixture Models (GMM) based on mIF marker expression
-
- On ORION-CRC only
-
WSI-to-tile extraction at 20× magnification (ORION-CRC)
-
Artifact filtering using:
-
Channel-based noise detection on mIF
-
Foundation model (H-optimus-0) feature clustering to remove H&E artifact tiles
-
-
Autofluorescence subtraction using a custom Napari tool with marker-specific correction formulas
-
Channel normalization via percentile clipping and log-transformation
-
These steps aim to produce high-confidence marker-positive cell annotations from noisy mIF data, enabling robust learning and evaluation on paired H&E images.
File Structure
The dataset is organized into the following archives and directories:
-
HEMIT_nuclei_analysis.zip
Preprocessed HEMIT data containing:-
Nuclei segmentation masks generated using Cellpose (40× resolution)
-
Corresponding single-cell data in
.csv
format (per-cell intensities and cell types) - ⚠️ This archive does not include the raw images from the original HEMIT dataset (License under the Creative Commons Attribution 4.0 International License (CC BY 4.0)). You must download the original dataset separately from Mendeley Data. The internal structure of this archive is designed to match the original, allowing direct integration of our nuclei and single-cell annotations.
-
-
ORION_dataset_20x.zip
ORION tile dataset at 20× magnification. Folder structure:-
he/
– JPEG tiles of H&E images -
if/
– 8-bit TIFF cleaned mIF images -
nuclei/
– Label TIFF nucleus masks -
csv_nuclei_pos/
– Per-WSI CSV files containing single-cell data: Nucleus position and cell types -
slide_dataframe.csv
Dataframe that maps each slide (identified byslide_name
) to its corresponding H&E, mIF, and nuclei WSI and CSV names.
Columns include:-
slide_name
: Unique slide IDs -
he_path
: Paths to the H&E WSI -
if_path
: Paths to the mIF WSI -
nuclei_path
: Paths to the nucleus label WSI nuclei_csv_path
: Paths to the CSV file containing single-cell data (nucleus positions and cell types)
-
-
train_dataframe.csv
,val_dataframe.csv
,test_dataframe.csv
Dataframes containing tile-level metadata for each dataset split. Each row corresponds to a tile used during model training or evaluation.
Columns include:-
slide_name
: IDs of the associated slide -
image_path
: Paths to the H&E tile -
target_path
: Paths to the corresponding mIF tile -
nuclei_path
: Paths to the nucleus label tile
-
-
-
ORION_dataset_20x_he_norm.zip
CycleGAN-normalized H&E images from ORION, transformed to match the staining style of IMMUcan data (20× resolution). These images can be used as augmentation during the training of MIPHEI-ViT.
To extract:
Use the following command:
7z x <path.zip>
Code & Tools
The full preprocessing pipeline used to produce this dataset — including tile extraction, autofluorescence correction, artifact removal, and single-cell analysis — is available at:
This code allows you to reproduce our results and adapt the workflow to new datasets.
Citation
Please cite the associated paper and Zenodo DOI when using this dataset:
G. Balezo, et al, "MIPHEI-ViT: Multiplex Immunofluorescence Prediction from H&E Images using ViT Foundation Models," 2025 [DOI TO UPDATE]
Files
HEMIT_nuclei_analysis.zip
Additional details
Related works
- Is derived from
- Dataset: 10.5281/zenodo.7637988 (DOI)
- Is supplement to
- Dataset: 10.17632/3gx53zm49d.1 (DOI)
Software
- Repository URL
- https://github.com/Sanofi-Public/MIPHEI-ViT
- Programming language
- Python
References
- G. Balezo, et al, "MIPHEI-ViT: Multiplex Immunofluorescence Prediction from H&E Images using ViT Foundation Models," 2025
- Lin, J., et al. "High-plex immunofluorescence imaging and traditional histology of the same tissue section for discovering image-based biomarkers," in Nature cancer, vol. 4, no. 7, pp. 1036–1052, 2023.
- Bian, C., et al, "HEMIT: H&E to Multiplex-immunohistochemistry Image Translation with Dual-Branch Pix2pix Generator," in International Conference on Medical Image Computing and Computer-Assisted Intervention, 2024, pp. 184–197.