Identifying millions of protein modifications using GPU-powered approximate nearest neighbor searching

doi:10.5281/zenodo.3831055

Published March 26, 2020 | Version v1

Presentation Open

Identifying millions of protein modifications using GPU-powered approximate nearest neighbor searching

Bittremieux, Wout¹

1. University of California San Diego

Session description

Learn how scientists are using GPUs to investigate the protein modification landscape at an unprecedented scale and depth.
Mass spectrometry is a high-throughput technique to measure proteins in complex biological samples. Unfortunately, a significant portion of the thousands of mass spectra that are generated during a proteomics experiment cannot be confidently identified. In many cases unidentified spectra correspond to modified proteins, as only a limited number of modifications can be considered during spectrum identification to avoid a search space explosion.
In contrast, during this session you will learn how proteins carrying any type of modification can be identified efficiently using GPU-powered approximate nearest neighbor searching. The ANN-SoLo spectral library search engine makes it possible to identify millions of modified proteins, outperforming alternative search tools by several orders of magnitude.

Extended abstract & results

I will demonstrate how we use the ANN-SoLo open modification spectral library search engine to efficiently and accurately identify tandem mass spectra corresponding to modified proteins. ANN-SoLo includes several algorithmic and computational advances.
First, complex mass spectrometry data is vectorized. To avoid the curse of dimensionality, high-dimensional, sparse vectors representing high-resolution mass spectra are converted to low-dimensional vectors that are amenable to nearest neighbor searching using feature hashing. Second, GPU-powered approximate nearest neighbor searching is used to efficiently find the relevant candidate spectra in the spectral library that need to be matched against each query spectrum using the Faiss library for nearest neighbor searching. Third, ANN-SoLo uses a multi-step search procedure to progressively expand the search space and an optimized similarity score to accurately identify both modified and unmodified proteins.

ANN-SoLo achieves state-of-the-art results in terms of speed and identification performance, identifying a record number of peptides while achieving a speedup of orders of magnitude compared to alternative search engines. For a benchmark dataset consisting of 1.1 million spectra in 22 individual mass spectrometry data files, ANN-SoLo, using a single NVIDIA GeForce RTX 2080 GPU, significantly outperformed SpectraST, a popular spectral library search tool (runtime (average per file): ANN-SoLo: 22 minutes, SpectraST: 2049 minutes; identification performance (all files): ANN-SoLo: 782,003 spectra; SpectraST: 619,237 spectra).
This computational efficiency of ANN-SoLo makes it possible to perform untargeted protein modification profiling via open searches at an unprecedented scale and depth. We have used ANN-SoLo to process a large dataset covering the human proteome, consisting of 25 million mass spectra, to identify over 14 million spectra, including 4.3 million modified spectra. Using traditional search tools it would be unfeasible to process such a large volume of data, whereas, due to its advanced, GPU-powered functionality ANN-SoLo can perform this task in a matter of minutes per file. This makes it possible to investigate the protein modification landscape in great detail for the first time and tackle important biological questions.

The ANN-SoLo spectral library search engine is implemented in Python and is freely available as open source at https://github.com/bittremieux/ANN-SoLo.

What NVIDIA hardware, software, platform, or solution are you using?

ANN-SoLo performs GPU-powered nearest neighbor searching using the Faiss library. ANN-SoLo was benchmarked using a single NVIDIA GeForce RTX 2080 GPU.

Presenter biography

Dr. Wout Bittremieux is a postdoctoral researcher at the University of California San Diego. His research deals with the application and development of advanced bioinformatics and data science techniques to analyze large-scale mass spectrometry proteomics and metabolomics datasets. Besides his main research focus on using deep learning to analyze mass spectrometry data, he has contributed to solutions covering a wide variety of bioinformatics problems. Dr. Bittremieux’s research has been published in leading scientific journals in the field, demonstrating how machine learning can be used to discover patterns in biological data and derive novel insights.

Files

Files (16.7 MB)

Name	Size	Download all
2020-03-26 - GTC 2020 - Identifying millions of protein modifications using GPU-powered approximate nearest neighbor searching.pptx md5:75b506ad0ad440b2772e147e5c57c798	16.7 MB	Download

	All versions	This version
Views	54	54
Downloads	13	13
Data volume	234.2 MB	234.2 MB

Identifying millions of protein modifications using GPU-powered approximate nearest neighbor searching

Creators

Description

Files

Files (16.7 MB)