Deep neural network embedding for efficient repository-scale analysis of hundreds of millions of mass spectra

10.5281/zenodo.3831052 https://zenodo.org/records/3831052 oai:zenodo.org:3831052 Bittremieux, Wout Wout Bittremieux 0000-0002-3105-1359 University of California San Diego May, Damon H. Damon H. May 0000-0001-6902-3153 University of Washington Bilmes, Jeffrey Jeffrey Bilmes University of Washington Noble, William Stafford William Stafford Noble 0000-0001-7283-4715 University of Washington Deep neural network embedding for efficient repository-scale analysis of hundreds of millions of mass spectra Zenodo 2020 2020-06-08 2020-05-18 Presentation 10.5281/zenodo.3831051 Creative Commons Attribution 4.0 International Introduction Despite an explosion of publicly available data in mass spectrometry proteomics repositories, peptide mass spectra are typically still analyzed by each laboratory in isolation, treating each experiment as if it has no relationship to any others. This approach fails to exploit the wealth of existing, previously analyzed mass spectrometry data. Here, we describe a deep neural network approach, called “GLEAMS”, which learns to embed spectra across an entire data repository into a low-dimensional space such that spectra generated by the same peptide are close to one another. This learned embedding captures latent properties of the spectra, and the low-dimensional space can be used for the efficient clustering and identification of hundreds of millions of spectra. Methods We have trained the GLEAMS deep neural network using peptide-spectrum assignments to embed spectra in a low-dimensional space. The neural network takes as input three feature types — precursor attributes, binned fragment intensities, and similarities to a set of reference spectra selected via submodular optimization — and consists of a combination of multiple convolutional and fully connected layers. To train the embedder network, a Siamese network containing two instances of the embedder with tied weights is trained via optimization of the contrastive loss function, pulling positive training pairs consisting of spectra corresponding to the same peptide together and pushing negative training pairs consisting of spectra corresponding to different peptides away from each other. Preliminary data We have used GLEAMS to process 31TB of human HCD proteomics data belonging to the MassIVE Knowledge Base dataset, corresponding to 666 million spectra derived from 220 publicly available experiments. After training the Siamese neural network, we observe that spectra generated by the same peptide lie close to each other in the embedded space. Additionally, the learned embeddings capture latent properties of the spectra, such as precursor mass and charge, and protein modifications correspond to translations in the latent space. Next, we investigate the “dark matter” of the human proteome using our large-scale and heterogeneous public dataset. First, we use DBSCAN density-based clustering to group repeatedly observed embeddings corresponding to similar spectra. By propagating peptide labels within high-quality clusters containing spectra that correspond to a single peptide, we can achieve an 8% increase in identification rate. Second, clusters that only contain unidentified spectra are processed using the ANN-SoLo open modification spectral library search engine to identify modified peptides that are frequently observed but consistently remain unidentified. This allows us to achieve an additional 22% increase in identified spectra. As a result, this combined strategy achieves a 30% increase in identifications relative to the MassIVE-KB standard database search results at a repository scale, providing valuable new insights into previously unlabeled data. In conclusion, the GLEAMS neural network is a powerful, scalable method that enables us to efficiently process hundreds of millions of MS/MS spectra and explore the dark human proteome at an unprecedented depth and scale. Novel aspect Repository-scale deep learning of hundreds of millions of spectra. Clustering and identifying the spectrum embeddings to investigate the dark proteome.