Published February 2, 2026
| Version 1.0.0
Dataset
Open
MHC-Diff 100K pMHC Structure Dataset (Multi-allele)
Authors/Creators
Description
# MHC-Diff 100K Dataset: Multi-Allele pMHC Structures
[](https://creativecommons.org/licenses/by/4.0/)
## Overview
This dataset contains **100,742 peptide-MHC class I (pMHC-I) structures** spanning 110 diverse HLA-I alleles and peptide lengths from 8 to 13 amino acids. It is designed for training and evaluating machine learning models for cross-allele pMHC structure prediction.
| Property | Value |
|----------|-------|
| **Total structures** | 100,742 |
| **X-ray structures** | 802 (from PDB and IMGT) |
| **PANDORA structures** | 99,940 (computationally modeled) |
| **MHC alleles** | 110 diverse HLA-I alleles |
| **Peptide lengths** | 8–13 amino acids |
| **Unique G-domains** | 286 |
| **Number of clusters** | 10 |
| **Total size** | ~47 GB |
## Data Sources
- **X-ray structures**: Experimental structures from the [Protein Data Bank (PDB)](https://www.rcsb.org/) and [IMGT/3Dstructure-DB](http://www.imgt.org/3Dstructure-DB/)
- **PANDORA structures**: Computationally modeled structures from the [PANDORA database](https://github.com/X-lab-3D/PANDORA)
## Clustering Strategy
MHC alleles were clustered using **hierarchical clustering** on G-domain sequences (the peptide-binding groove) with **BLOSUM62** similarity scores. This ensures that test alleles have low sequence similarity to training alleles, enabling rigorous evaluation of cross-allele generalization.
The varying cluster sizes reflect the natural distribution of G-domain sequence families.
### Cluster Composition
| Cluster | G-domains | PANDORA | X-ray | Total |
|---------|-----------|---------|-------|-------|
| 1 | 2 | 59 | 0 | 59 |
| 2 | 74 | 30,044 | 138 | 30,182 |
| 3 | 11 | 3,412 | 18 | 3,430 |
| 4 | 11 | 523 | 52 | 575 |
| 5 | 54 | 26,227 | 351 | 26,578 |
| 6 | 2 | 0 | 2 | 2 |
| 7 | 30 | 2,337 | 78 | 2,415 |
| 8 | 2 | 10,425 | 0 | 10,425 |
| 9 | 99 | 26,889 | 163 | 27,052 |
| 10 | 1 | 24 | 0 | 24 |
## Files
```
mhc-diff-100k-v1.0/
├── README.md # This file
├── LICENSE # CC-BY-4.0 license
├── SHA256SUMS # Checksums for all files
├── samples.parquet # Sample index (recommended)
├── samples.tsv.gz # Sample index (alternative format)
├── split_recipes/ # Split definitions
│ ├── paper_split.json # Train/val/test as used in the paper
│ ├── fold_cluster1.json # Leave cluster 1 out
│ ├── ...
│ └── README.json # Split recipe documentation
└── structures/ # HDF5 structure files
├── cluster_1.hdf5
├── cluster_2.hdf5.gz # Gzip compressed (decompress before use)
├── ...
└── cluster_10.hdf5
```
**Note:** `cluster_2.hdf5.gz` is gzip-compressed to reduce download size. Decompress before use:
```bash
gunzip structures/cluster_2.hdf5.gz
```
## Paper Split (Recommended)
| Split | Clusters | Structures | X-ray |
|-------|----------|------------|-------|
| **Train** | 1, 2, 5, 8, 9, 10 | 94,320 | 652 |
| **Validation** | 7 | 2,415 | 78 |
| **Test** | 3, 4, 6 | 4,007 | 72 |
## Data Format
### Sample Index (`samples.parquet`)
| Column | Description |
|--------|-------------|
| `sample_id` | Unique structure identifier |
| `cluster_id` | Cluster assignment (1-10) |
| `source` | `xray` or `pandora` |
| `structure_file` | HDF5 file containing the structure |
### HDF5 Structure Files
Each HDF5 file contains multiple structures indexed by `sample_id`:
**X-ray structures** (4-letter PDB codes):
```python
import h5py
with h5py.File('cluster_2.hdf5', 'r') as f:
pdb_string = f['1AKJ'][()].decode('utf-8') # Raw PDB format
```
**PANDORA structures** (IDs starting with `BA-`):
```python
with h5py.File('cluster_2.hdf5', 'r') as f:
entry = f['BA-100003']
peptide_coords = entry['peptide']['atom14_gt_positions'][:, 1, :] # Cα coords
protein_coords = entry['protein']['atom14_gt_positions'][:, 1, :] # Cα coords
peptide_seq = entry['peptide']['aatype'][:] # Amino acid indices (0-19)
```
## Usage
### Paper Split
```python
import pandas as pd
import json
# Load sample index
samples = pd.read_parquet('samples.parquet')
# Load paper split
with open('split_recipes/paper_split.json') as f:
split = json.load(f)
# Create splits
train = samples[samples['cluster_id'].isin(split['train_clusters'])]
val = samples[samples['cluster_id'].isin(split['validation_clusters'])]
test = samples[samples['cluster_id'].isin(split['test_clusters'])]
print(f"Train: {len(train)}, Val: {len(val)}, Test: {len(test)}")
```
### Leave-One-Cluster-Out Cross-Validation
```python
for cluster_id in range(1, 11):
with open(f'split_recipes/fold_cluster{cluster_id}.json') as f:
fold = json.load(f)
train = samples[samples['cluster_id'].isin(fold['train_clusters'])]
test = samples[samples['cluster_id'].isin(fold['test_clusters'])]
```
## Related Datasets
The **MHC-Diff 8K Dataset** is a subset of this dataset, focusing specifically on HLA-A\*02:01 with 9-mer peptides.
- **MHC-Diff 8K Dataset**: [Zenodo DOI to be added]
## Citation
If you use this dataset, please cite:
```bibtex
@article{fruhbuss2025mhcdiff,
title={MHC-Diff: Fast and Accurate Peptide-MHC Structure Prediction via an Equivariant Diffusion Model},
author={Fr{\"u}hbu{\ss}, David and Baakman, Coos and Teusink, Siem and Bekkers, Erik and Jegelka, Stefanie and Xue, Li},
year={2025}
}
```
## References
1. Berman, H.M., et al. "The Protein Data Bank." *Nucleic Acids Research* 28(1), 235–242 (2000). https://doi.org/10.1093/nar/28.1.235
2. Lefranc, M.-P., et al. "IMGT/3Dstructure-DB." *Nucleic Acids Research* 33(suppl 1), D593–D597 (2005). https://doi.org/10.1093/nar/gki010
3. Marzella, D.F., et al. "PANDORA: a fast, anchor-restrained modelling protocol for peptide:MHC complexes." *Frontiers in Immunology* 13, 878762 (2022). https://doi.org/10.3389/fimmu.2022.878762
4. Marzella, D.F., Crocioni, G., et al. "Geometric deep learning improves generalizability of MHC-bound peptide predictions." *Communications Biology* 7(1), 1661 (2024). https://doi.org/10.1038/s42003-024-07292-1
## License
This dataset is released under the [Creative Commons Attribution 4.0 International License (CC-BY-4.0)](https://creativecommons.org/licenses/by/4.0/).
## Contact
- Li Xue: Li.Xue@radboudumc.nl
Files
mhc-diff-100k-v1.0.zip
Files
(27.5 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:4c29fe111bf4a081de91ba3e38d6224c
|
27.5 GB | Preview Download |
Additional details
Related works
- Is published in
- Journal article: 10.1101/2025.04.28.650973 (DOI)
Software
- Repository URL
- https://github.com/DavidFruehbuss/MHC-Diff
- Programming language
- Python
- Development Status
- Active
References
- Marzella DF, Parizi FM, van Tilborg D, Renaud N, Sybrandi D, Buzatu R, Rademaker DT, 't Hoen PAC, Xue LC. PANDORA: A Fast, Anchor-Restrained Modelling Protocol for Peptide: MHC Complexes. Front Immunol. 2022 May 10;13:878762. doi: 10.3389/fimmu.2022.878762. PMID: 35619705; PMCID: PMC9127323.