MHC-Diff 100K pMHC Structure Dataset (Multi-allele)

Frühbuß, David; Baakman, Coos; Teusink, Siem; Bekkers, Erik; Jegelka, Stefanie; Xue, Li C.

doi:10.5281/zenodo.18456927

Published February 2, 2026 | Version 1.0.0

Dataset Open

MHC-Diff 100K pMHC Structure Dataset (Multi-allele)

1. Max-Planck Institute for Biochemistry
2. Radboud University Medical Center
3. University of Amsterdam
4. Technical University of Munich
5. Massachusetts Institute of Technology

Contributors

Researcher:

Marzella, Dario¹

1. Radboud University Medical Center

# MHC-Diff 100K Dataset: Multi-Allele pMHC Structures

[![License: CC BY 4.0](https://img.shields.io/badge/License-CC_BY_4.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/)

## Overview

This dataset contains **100,742 peptide-MHC class I (pMHC-I) structures** spanning 110 diverse HLA-I alleles and peptide lengths from 8 to 13 amino acids. It is designed for training and evaluating machine learning models for cross-allele pMHC structure prediction.

| Property | Value |

|----------|-------|

| **Total structures** | 100,742 |

| **X-ray structures** | 802 (from PDB and IMGT) |

| **PANDORA structures** | 99,940 (computationally modeled) |

| **MHC alleles** | 110 diverse HLA-I alleles |

| **Peptide lengths** | 8–13 amino acids |

| **Unique G-domains** | 286 |

| **Number of clusters** | 10 |

| **Total size** | ~47 GB |

## Data Sources

- **X-ray structures**: Experimental structures from the [Protein Data Bank (PDB)](https://www.rcsb.org/) and [IMGT/3Dstructure-DB](http://www.imgt.org/3Dstructure-DB/)

- **PANDORA structures**: Computationally modeled structures from the [PANDORA database](https://github.com/X-lab-3D/PANDORA)

## Clustering Strategy

MHC alleles were clustered using **hierarchical clustering** on G-domain sequences (the peptide-binding groove) with **BLOSUM62** similarity scores. This ensures that test alleles have low sequence similarity to training alleles, enabling rigorous evaluation of cross-allele generalization.

The varying cluster sizes reflect the natural distribution of G-domain sequence families.

### Cluster Composition

|---------|-----------|---------|-------|-------|

| 1 | 2 | 59 | 0 | 59 |

| 2 | 74 | 30,044 | 138 | 30,182 |

| 3 | 11 | 3,412 | 18 | 3,430 |

| 4 | 11 | 523 | 52 | 575 |

| 5 | 54 | 26,227 | 351 | 26,578 |

| 6 | 2 | 0 | 2 | 2 |

| 7 | 30 | 2,337 | 78 | 2,415 |

| 8 | 2 | 10,425 | 0 | 10,425 |

| 9 | 99 | 26,889 | 163 | 27,052 |

| 10 | 1 | 24 | 0 | 24 |

## Files

```

mhc-diff-100k-v1.0/

├── README.md # This file

├── LICENSE # CC-BY-4.0 license

├── SHA256SUMS # Checksums for all files

├── samples.parquet # Sample index (recommended)

├── samples.tsv.gz # Sample index (alternative format)

├── split_recipes/ # Split definitions

│ ├── paper_split.json # Train/val/test as used in the paper

│ ├── fold_cluster1.json # Leave cluster 1 out

│ ├── ...

│ └── README.json # Split recipe documentation

└── structures/ # HDF5 structure files

├── cluster_1.hdf5

├── cluster_2.hdf5.gz # Gzip compressed (decompress before use)

├── ...

└── cluster_10.hdf5

```

**Note:** `cluster_2.hdf5.gz` is gzip-compressed to reduce download size. Decompress before use:

```bash

gunzip structures/cluster_2.hdf5.gz

```

## Paper Split (Recommended)

|-------|----------|------------|-------|

| **Train** | 1, 2, 5, 8, 9, 10 | 94,320 | 652 |

| **Validation** | 7 | 2,415 | 78 |

| **Test** | 3, 4, 6 | 4,007 | 72 |

## Data Format

### Sample Index (`samples.parquet`)

| Column | Description |

|--------|-------------|

| `sample_id` | Unique structure identifier |

| `cluster_id` | Cluster assignment (1-10) |

| `source` | `xray` or `pandora` |

| `structure_file` | HDF5 file containing the structure |

### HDF5 Structure Files

Each HDF5 file contains multiple structures indexed by `sample_id`:

**X-ray structures** (4-letter PDB codes):

```python

import h5py

with h5py.File('cluster_2.hdf5', 'r') as f:

pdb_string = f['1AKJ'][()].decode('utf-8') # Raw PDB format

```

**PANDORA structures** (IDs starting with `BA-`):

```python

with h5py.File('cluster_2.hdf5', 'r') as f:

entry = f['BA-100003']

peptide_coords = entry['peptide']['atom14_gt_positions'][:, 1, :] # Cα coords

protein_coords = entry['protein']['atom14_gt_positions'][:, 1, :] # Cα coords

peptide_seq = entry['peptide']['aatype'][:] # Amino acid indices (0-19)

```

## Usage

### Paper Split

```python

import pandas as pd

import json

# Load sample index

samples = pd.read_parquet('samples.parquet')

# Load paper split

with open('split_recipes/paper_split.json') as f:

split = json.load(f)

# Create splits

train = samples[samples['cluster_id'].isin(split['train_clusters'])]

val = samples[samples['cluster_id'].isin(split['validation_clusters'])]

test = samples[samples['cluster_id'].isin(split['test_clusters'])]

print(f"Train: {len(train)}, Val: {len(val)}, Test: {len(test)}")

```

### Leave-One-Cluster-Out Cross-Validation

```python

for cluster_id in range(1, 11):

with open(f'split_recipes/fold_cluster{cluster_id}.json') as f:

fold = json.load(f)

train = samples[samples['cluster_id'].isin(fold['train_clusters'])]

test = samples[samples['cluster_id'].isin(fold['test_clusters'])]

```

## Related Datasets

The **MHC-Diff 8K Dataset** is a subset of this dataset, focusing specifically on HLA-A\*02:01 with 9-mer peptides.

- **MHC-Diff 8K Dataset**: [Zenodo DOI to be added]

## Citation

If you use this dataset, please cite:

```bibtex

@article{fruhbuss2025mhcdiff,

title={MHC-Diff: Fast and Accurate Peptide-MHC Structure Prediction via an Equivariant Diffusion Model},

author={Fr{\"u}hbu{\ss}, David and Baakman, Coos and Teusink, Siem and Bekkers, Erik and Jegelka, Stefanie and Xue, Li},

year={2025}

}

```

## References

1. Berman, H.M., et al. "The Protein Data Bank." *Nucleic Acids Research* 28(1), 235–242 (2000). https://doi.org/10.1093/nar/28.1.235

2. Lefranc, M.-P., et al. "IMGT/3Dstructure-DB." *Nucleic Acids Research* 33(suppl 1), D593–D597 (2005). https://doi.org/10.1093/nar/gki010

3. Marzella, D.F., et al. "PANDORA: a fast, anchor-restrained modelling protocol for peptide:MHC complexes." *Frontiers in Immunology* 13, 878762 (2022). https://doi.org/10.3389/fimmu.2022.878762

4. Marzella, D.F., Crocioni, G., et al. "Geometric deep learning improves generalizability of MHC-bound peptide predictions." *Communications Biology* 7(1), 1661 (2024). https://doi.org/10.1038/s42003-024-07292-1

## License

This dataset is released under the [Creative Commons Attribution 4.0 International License (CC-BY-4.0)](https://creativecommons.org/licenses/by/4.0/).

## Contact

- Li Xue: Li.Xue@radboudumc.nl

Files

mhc-diff-100k-v1.0.zip

Files (27.5 GB)

Name	Size	Download all
mhc-diff-100k-v1.0.zip md5:4c29fe111bf4a081de91ba3e38d6224c	27.5 GB	Preview Download

Additional details

Is published in: Journal article: 10.1101/2025.04.28.650973 (DOI)

Repository URL: https://github.com/DavidFruehbuss/MHC-Diff
Programming language: Python
Development Status: Active

Marzella DF, Parizi FM, van Tilborg D, Renaud N, Sybrandi D, Buzatu R, Rademaker DT, 't Hoen PAC, Xue LC. PANDORA: A Fast, Anchor-Restrained Modelling Protocol for Peptide: MHC Complexes. Front Immunol. 2022 May 10;13:878762. doi: 10.3389/fimmu.2022.878762. PMID: 35619705; PMCID: PMC9127323.

	All versions	This version
Views	93	93
Downloads	0	0
Data volume	0 Bytes	0 Bytes

Contributors

Researcher:

mhc-diff-100k-v1.0.zip

Files (27.5 GB)

Related works

Software

References

MHC-Diff 100K pMHC Structure Dataset (Multi-allele)

Authors/Creators

Contributors

Researcher:

Description

Files

mhc-diff-100k-v1.0.zip

Files (27.5 GB)

Additional details

Related works

Software

References