usiGrabber: Automating the curation of proteomics spectra data at scale, making large datasets ready for use in AI systems

Auge, Georg; Clausen, Matthis; Ketterer, Konstantin; Schaefer, Jacob; Schmitt, Nils; Altenburg, Tom; Hartmaring, Yannick; Raetz, Hendrik; Schlaffner, Christoph N.; Renard, Bernhard Y.

doi:10.5281/zenodo.18853258

Published March 16, 2026 | Version 1.0

Dataset Open

usiGrabber: Automating the curation of proteomics spectra data at scale, making large datasets ready for use in AI systems

1. Hasso Plattner Institute
2. University of Potsdam

usiGrabber is a scalable framework for assembling large and diverse mass-spectrometry datasets ready to be used for machine learning use cases

As a proof of concept, we used usiGrabber to construct a phosphorylation-specific training dataset of nearly 11 million spectra and used it to retrain a binary phosphorylation classifier. This dataset and the corresponding model weights are available in this record.

The publication also includes the complete database, which contains spectrum information and metadata for over 800 million spectra present in the PRIDE database. Because of its size, it had to be split into multiple uploads.

In order to reconstruct the entire database, you must download all related records. Once you have downloaded all records, extract the archives and refer to usiGrabber - db_export for instructions for reassembly.

Related records:

peptide_spectrum_matches table: https://zenodo.org/records/18890370
psm_peptide_evidence table: https://zenodo.org/records/18864164
Other, smaller tables: https://zenodo.org/records/18873214

Files

usigrabber.png

Files (37.9 GB)

Name	Size
dataset-export.tar.gz md5:abd24c97b96feab173eecfaccf940d92	37.9 GB	Download
usigrabber.png md5:c89ff389489391a0d4958a42acd39851	934.1 kB	Preview Download
usigrabber_model_weights.hdf5 md5:e8d6be2795790e858cd78db6a267f81c	26.9 MB	Download

Additional details

European Commission
explainProt - Explainable Machine Learning for Identifying the Full Heterogeneity of Peptidoforms and Proteoforms 101124385

Repository URL: https://github.com/usiGrabber/usiGrabber
Programming language: Python
Development Status: Concept

	All versions	This version
Views	47	47
Downloads	45	45
Data volume	530.8 GB	530.8 GB

usigrabber.png

Files (37.9 GB)

Funding

Software

usiGrabber: Automating the curation of proteomics spectra data at scale, making large datasets ready for use in AI systems

Authors/Creators

Description

Files

usigrabber.png

Files (37.9 GB)

Additional details

Funding

Software