Published March 16, 2026 | Version 1.0
Dataset Open

usiGrabber: Automating the curation of proteomics spectra data at scale, making large datasets ready for use in AI systems

Description

usiGrabber is a scalable framework for assembling large and diverse mass-spectrometry datasets ready to be used for machine learning use cases

As a proof of concept, we used usiGrabber to construct a phosphorylation-specific training dataset of nearly 11 million spectra and used it to retrain a binary phosphorylation classifier. This dataset and the corresponding model weights are available in this record.

The publication also includes the complete database, which contains spectrum information and metadata for over 800 million spectra present in the PRIDE database. Because of its size, it had to be split into multiple uploads.

In order to reconstruct the entire database, you must download all related records. Once you have downloaded all records, extract the archives and refer to usiGrabber - db_export for instructions for reassembly.

Related records:

Files

usigrabber.png

Files (37.9 GB)

Name Size Download all
md5:abd24c97b96feab173eecfaccf940d92
37.9 GB Download
md5:c89ff389489391a0d4958a42acd39851
934.1 kB Preview Download
md5:e8d6be2795790e858cd78db6a267f81c
26.9 MB Download

Additional details

Funding

European Commission
explainProt - Explainable Machine Learning for Identifying the Full Heterogeneity of Peptidoforms and Proteoforms 101124385

Software

Repository URL
https://github.com/usiGrabber/usiGrabber
Programming language
Python
Development Status
Concept