Published June 30, 2023 | Version v1
Presentation Open

Leveraging public mass spectrometry for molecular discovery from millions of mass spectra

  • 1. University of Antwerp


Despite an explosion of publicly available data in mass spectrometry repositories, mass spectra are typically still analyzed by each laboratory in isolation, treating each experiment as if it has no relationship to any others. This approach fails to exploit the wealth of existing, previously analyzed mass spectrometry data. Here I will present two approaches that harness repository-scale public mass spectrometry data to achieve a deeper understanding of the molecular composition of complex biological samples.

First, I will describe a deep neural network approach, called "GLEAMS," which learns to embed spectra across an entire data repository into a low-dimensional space such that spectra generated by the same peptide are close to one another. This learned embedding captures latent properties of the spectra, and the low-dimensional space can be used for the efficient clustering and identification of hundreds of millions of spectra. We have used GLEAMS to process 31TB of human proteomics data belonging to the MassIVE Knowledge Base dataset, corresponding to 666 million spectra derived from 220 publicly available experiments. Using GLEAMS, we were able to investigate the "dark matter" of the human proteome by propagating peptide labels within high-quality clusters and open modification searching to annotate 73% additional mass spectra compared to the state of the art.

Second, I will present the "nearest neighbor suspect spectral library" that can be used to identify molecules that are structurally related to previously known reference molecules. Based on spectral similarity, information can be propagated to neighboring mass spectra in a molecular network to increase the spectrum annotation rate. We have propagated annotations from molecular networks associated with 521 million mass spectra from 1335 compatible untargeted metabolomics datasets in various metabolomics data repositories, including GNPS/MassIVE, Metabolights, and Metabolomics Workbench, to create the GNPS nearest neighbor suspect spectral library. It consists of 87,916 novel reference spectra corresponding to modified molecules that are structurally related to known reference molecules. Using the suspect library for spectral library searching increases the spectrum annotation rate 2-fold on average, considerably boosting the interpretation rate of untargeted metabolomics beyond the state of the art.

In conclusion, these two powerful computational approaches highlight the enormous value of open data, and provide significant breakthroughs in deriving biological insights from mass spectrometry proteomics and metabolomics experiments.


Files (9.5 MB)

Name Size Download all
9.5 MB Download