Published December 13, 2025 | Version v1
Dataset Open

Code and Datasets for MassNet: Database Search Workflows, Retention Time Prediction, and PSM Rescoring

Creators

  • 1. ROR icon Westlake University

Description

This repository compiles the core resources used to construct the MassNet dataset, including:

1) FASTA sequence files for each species, used for database searching;

2) Standardized database search workflows based on FragPipe and Sage engines for unified processing of raw DDA-MS data and high-confidence peptide identification.

Additionally, the repository provides the following data resources and supporting tools for downstream AI tasks:

1) Retention time (RT) prediction task: training and validation datasets constructed from FragPipe and Sage results, along with corresponding RT prediction model outputs;

2) Peptide-spectrum match (PSM) rescoring task: PSM datasets for training and evaluation results;

Dataset construction tools: complete code and documentation for generating the above task-specific datasets.

For detailed model training procedures and usage instructions, please refer to the following official repositories:
DeepLC: https://github.com/CompOmics/DeepLC
DDA-BERT: https://github.com/guomics-lab/DDA-BERT

All resources provided in this repository enable full reproduction of the core experimental and analytical results reported in the manuscript "MassNet: billion-scale AI-ready mass spectrometry corpus enabling scalable deep Learning in proteomics".

Files

MassNet_PSM_rescoring.zip

Files (7.0 GB)

Name Size Download all
md5:64d398c401a11c3fd52cd4711ac56bc5
136.2 MB Preview Download
md5:33d2565aaec497a0c7c83b2d4a6d4ee6
6.7 GB Preview Download
md5:87fe690944853708b1f0b5ea270d23ef
5.9 MB Preview Download
md5:bb0f2aa26355106e645af12825c22b5d
10.7 kB Preview Download
md5:33a7a001f58883ef3d97db0e4d016609
114.0 MB Preview Download
md5:b9c02eb1c7054023153cd638f4f57e42
5.6 kB Preview Download