Published February 2, 2026 | Version 1.0.0
Dataset Open

MHC-Diff 100K pMHC Structure Dataset (Multi-allele)

  • 1. Max-Planck Institute for Biochemistry
  • 2. ROR icon Radboud University Medical Center
  • 3. ROR icon University of Amsterdam
  • 4. EDMO icon Technical University of Munich
  • 5. EDMO icon Massachusetts Institute of Technology

Contributors

Researcher:

  • 1. ROR icon Radboud University Medical Center

Description

# MHC-Diff 100K Dataset: Multi-Allele pMHC Structures

[![License: CC BY 4.0](https://img.shields.io/badge/License-CC_BY_4.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/)

## Overview

This dataset contains **100,742 peptide-MHC class I (pMHC-I) structures** spanning 110 diverse HLA-I alleles and peptide lengths from 8 to 13 amino acids. It is designed for training and evaluating machine learning models for cross-allele pMHC structure prediction.

| Property | Value |
|----------|-------|
| **Total structures** | 100,742 |
| **X-ray structures** | 802 (from PDB and IMGT) |
| **PANDORA structures** | 99,940 (computationally modeled) |
| **MHC alleles** | 110 diverse HLA-I alleles |
| **Peptide lengths** | 8–13 amino acids |
| **Unique G-domains** | 286 |
| **Number of clusters** | 10 |
| **Total size** | ~47 GB |

## Data Sources

- **X-ray structures**: Experimental structures from the [Protein Data Bank (PDB)](https://www.rcsb.org/) and [IMGT/3Dstructure-DB](http://www.imgt.org/3Dstructure-DB/)
- **PANDORA structures**: Computationally modeled structures from the [PANDORA database](https://github.com/X-lab-3D/PANDORA)

## Clustering Strategy

MHC alleles were clustered using **hierarchical clustering** on G-domain sequences (the peptide-binding groove) with **BLOSUM62** similarity scores. This ensures that test alleles have low sequence similarity to training alleles, enabling rigorous evaluation of cross-allele generalization.

The varying cluster sizes reflect the natural distribution of G-domain sequence families.

### Cluster Composition

| Cluster | G-domains | PANDORA | X-ray | Total |
|---------|-----------|---------|-------|-------|
| 1 | 2 | 59 | 0 | 59 |
| 2 | 74 | 30,044 | 138 | 30,182 |
| 3 | 11 | 3,412 | 18 | 3,430 |
| 4 | 11 | 523 | 52 | 575 |
| 5 | 54 | 26,227 | 351 | 26,578 |
| 6 | 2 | 0 | 2 | 2 |
| 7 | 30 | 2,337 | 78 | 2,415 |
| 8 | 2 | 10,425 | 0 | 10,425 |
| 9 | 99 | 26,889 | 163 | 27,052 |
| 10 | 1 | 24 | 0 | 24 |

## Files

```
mhc-diff-100k-v1.0/
├── README.md # This file
├── LICENSE # CC-BY-4.0 license
├── SHA256SUMS # Checksums for all files
├── samples.parquet # Sample index (recommended)
├── samples.tsv.gz # Sample index (alternative format)
├── split_recipes/ # Split definitions
│ ├── paper_split.json # Train/val/test as used in the paper
│ ├── fold_cluster1.json # Leave cluster 1 out
│ ├── ...
│ └── README.json # Split recipe documentation
└── structures/ # HDF5 structure files
├── cluster_1.hdf5
├── cluster_2.hdf5.gz # Gzip compressed (decompress before use)
├── ...
└── cluster_10.hdf5
```

**Note:** `cluster_2.hdf5.gz` is gzip-compressed to reduce download size. Decompress before use:
```bash
gunzip structures/cluster_2.hdf5.gz
```

## Paper Split (Recommended)

| Split | Clusters | Structures | X-ray |
|-------|----------|------------|-------|
| **Train** | 1, 2, 5, 8, 9, 10 | 94,320 | 652 |
| **Validation** | 7 | 2,415 | 78 |
| **Test** | 3, 4, 6 | 4,007 | 72 |

## Data Format

### Sample Index (`samples.parquet`)

| Column | Description |
|--------|-------------|
| `sample_id` | Unique structure identifier |
| `cluster_id` | Cluster assignment (1-10) |
| `source` | `xray` or `pandora` |
| `structure_file` | HDF5 file containing the structure |

### HDF5 Structure Files

Each HDF5 file contains multiple structures indexed by `sample_id`:

**X-ray structures** (4-letter PDB codes):
```python
import h5py
with h5py.File('cluster_2.hdf5', 'r') as f:
pdb_string = f['1AKJ'][()].decode('utf-8') # Raw PDB format
```

**PANDORA structures** (IDs starting with `BA-`):
```python
with h5py.File('cluster_2.hdf5', 'r') as f:
entry = f['BA-100003']
peptide_coords = entry['peptide']['atom14_gt_positions'][:, 1, :] # Cα coords
protein_coords = entry['protein']['atom14_gt_positions'][:, 1, :] # Cα coords
peptide_seq = entry['peptide']['aatype'][:] # Amino acid indices (0-19)
```

## Usage

### Paper Split

```python
import pandas as pd
import json

# Load sample index
samples = pd.read_parquet('samples.parquet')

# Load paper split
with open('split_recipes/paper_split.json') as f:
split = json.load(f)

# Create splits
train = samples[samples['cluster_id'].isin(split['train_clusters'])]
val = samples[samples['cluster_id'].isin(split['validation_clusters'])]
test = samples[samples['cluster_id'].isin(split['test_clusters'])]

print(f"Train: {len(train)}, Val: {len(val)}, Test: {len(test)}")
```

### Leave-One-Cluster-Out Cross-Validation

```python
for cluster_id in range(1, 11):
with open(f'split_recipes/fold_cluster{cluster_id}.json') as f:
fold = json.load(f)
 
train = samples[samples['cluster_id'].isin(fold['train_clusters'])]
test = samples[samples['cluster_id'].isin(fold['test_clusters'])]
```

## Related Datasets

The **MHC-Diff 8K Dataset** is a subset of this dataset, focusing specifically on HLA-A\*02:01 with 9-mer peptides.

- **MHC-Diff 8K Dataset**: [Zenodo DOI to be added]

## Citation

If you use this dataset, please cite:

```bibtex
@article{fruhbuss2025mhcdiff,
title={MHC-Diff: Fast and Accurate Peptide-MHC Structure Prediction via an Equivariant Diffusion Model},
author={Fr{\"u}hbu{\ss}, David and Baakman, Coos and Teusink, Siem and Bekkers, Erik and Jegelka, Stefanie and Xue, Li},
year={2025}
}
```

## References

1. Berman, H.M., et al. "The Protein Data Bank." *Nucleic Acids Research* 28(1), 235–242 (2000). https://doi.org/10.1093/nar/28.1.235
2. Lefranc, M.-P., et al. "IMGT/3Dstructure-DB." *Nucleic Acids Research* 33(suppl 1), D593–D597 (2005). https://doi.org/10.1093/nar/gki010
3. Marzella, D.F., et al. "PANDORA: a fast, anchor-restrained modelling protocol for peptide:MHC complexes." *Frontiers in Immunology* 13, 878762 (2022). https://doi.org/10.3389/fimmu.2022.878762
4. Marzella, D.F., Crocioni, G., et al. "Geometric deep learning improves generalizability of MHC-bound peptide predictions." *Communications Biology* 7(1), 1661 (2024). https://doi.org/10.1038/s42003-024-07292-1

## License

This dataset is released under the [Creative Commons Attribution 4.0 International License (CC-BY-4.0)](https://creativecommons.org/licenses/by/4.0/).

## Contact

- Li Xue: Li.Xue@radboudumc.nl

 

Files

mhc-diff-100k-v1.0.zip

Files (27.5 GB)

Name Size Download all
md5:4c29fe111bf4a081de91ba3e38d6224c
27.5 GB Preview Download

Additional details

Related works

Is published in
Journal article: 10.1101/2025.04.28.650973 (DOI)

Software

Repository URL
https://github.com/DavidFruehbuss/MHC-Diff
Programming language
Python
Development Status
Active

References

  • Marzella DF, Parizi FM, van Tilborg D, Renaud N, Sybrandi D, Buzatu R, Rademaker DT, 't Hoen PAC, Xue LC. PANDORA: A Fast, Anchor-Restrained Modelling Protocol for Peptide: MHC Complexes. Front Immunol. 2022 May 10;13:878762. doi: 10.3389/fimmu.2022.878762. PMID: 35619705; PMCID: PMC9127323.