usiGrabber: Automating the curation of proteomics spectra data at scale, making large datasets ready for use in AI systems
Authors/Creators
-
Auge, Georg
(Project member)1, 2
-
Clausen, Matthis
(Project member)1, 2
-
Ketterer, Konstantin
(Project member)1, 2
-
Schaefer, Jacob
(Project member)1, 2
-
Schmitt, Nils
(Project member)1, 2
-
Altenburg, Tom
(Related person)1, 2
-
Hartmaring, Yannick
(Supervisor)1, 2
-
Raetz, Hendrik
(Supervisor)1, 2
-
Schlaffner, Christoph N.
(Supervisor)1, 2
-
Renard, Bernhard Y.
(Supervisor)1, 2
Description
usiGrabber is a scalable framework for assembling large and diverse mass-spectrometry datasets ready to be used for machine learning use cases
As a proof of concept, we used usiGrabber to construct a phosphorylation-specific training dataset of nearly 11 million spectra and used it to retrain a binary phosphorylation classifier. This dataset and the corresponding model weights are available in this record.
The publication also includes the complete database, which contains spectrum information and metadata for over 800 million spectra present in the PRIDE database. Because of its size, it had to be split into multiple uploads.
In order to reconstruct the entire database, you must download all related records. Once you have downloaded all records, extract the archives and refer to usiGrabber - db_export for instructions for reassembly.
Related records:
- peptide_spectrum_matches table: https://zenodo.org/records/18890370
- psm_peptide_evidence table: https://zenodo.org/records/18864164
- Other, smaller tables: https://zenodo.org/records/18873214
Files
usigrabber.png
Files
(37.9 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:abd24c97b96feab173eecfaccf940d92
|
37.9 GB | Download |
|
md5:c89ff389489391a0d4958a42acd39851
|
934.1 kB | Preview Download |
|
md5:e8d6be2795790e858cd78db6a267f81c
|
26.9 MB | Download |
Additional details
Funding
Software
- Repository URL
- https://github.com/usiGrabber/usiGrabber
- Programming language
- Python
- Development Status
- Concept