ANN-SoLo: Extremely fast and accurate open modification spectral library searching
Open modification searching (OMS) is a powerful search strategy that identifies peptides carrying any type of modification by allowing a modified spectrum to match against its unmodified variant using a very wide precursor mass window. A drawback of this strategy, however, is that it leads to a large increase in search time. Although performing an open search can be done using existing search engines by simply setting a wide precursor mass window, none of these tools have been optimized for OMS, leading to excessive runtimes and suboptimal identification results. Here we present updates to the ANN-SoLo spectral library search tool, which is able to extremely efficiently and accurately identify any type of modification for millions of spectra.
ANN-SoLo can efficiently handle spectral libraries containing several million spectra by making use of techniques that have been optimized for web-scale similarity searching. Approximate nearest neighbor (ANN) indexing is used to speed up open modification searching by selecting only a limited number of the most relevant library spectra to compare to an unknown query spectrum. Here we present an application of the hashing trick to convert high-resolution mass spectra into low-dimensional, space-efficient vectors while still capturing detailed fragment mass information. Additionally, we show how specialized hardware, such as graphical processing units (GPUs), can be used for ANN searching to speed up the candidate selection from the spectral library during OMS.
First, high-resolution, sparse, vectorized spectra are converted to much shorter, dense, vectors using the hashing trick. Feature hashing is used to map mass bins in the high-resolution vectors to positions in much shorter hashed vectors. Importantly, the hashing trick conserves the similarity under the cosine distance, which allows us to use the hashed vectors instead of the original vectors to do ANN indexing during spectrum identification. Using the hashed vectors leads to a five-fold reduction in memory requirements compared to using spectrum vectors using unit mass bins, minimizing the storage requirements. Additionally, the hashed vectors (using 800 hash bins) more accurately capture spectral similarities compared to vectors with unit mass bins (Pearson correlation 0.99 versus 0.84 respectively compared to a high-resolution, peak-by-peak, dot product), allowing us to more accurately retrieve suitable candidates from the spectral library.
Second, a major speed improvement comes from the use of GPUs to accelerate OMS. As GPUs excel in data-parallel tasks ANN-SoLo was modified to perform batch query processing to simultaneously identify multiple unknown query spectra. These improvements lead to a speed-up of up to an order of magnitude over the initial version of ANN-SoLo, which itself already massively outperformed alternative spectral library search engines in terms of speed and identification performance. Additionally, ANN-SoLo outperforms sequence database search engines optimized for OMS, such as MSFragger, in the number of identified spectra through its more sensitive spectrum-spectrum matching using high-quality and comprehensive spectral libraries.
The new version of ANN-SoLo is able to accurately identify several thousand spectra per minute. This excellent computational performance allows us to investigate PTMs in large datasets. We will present results on PTM frequency in the human draft proteome.
Hashing trick to vectorize high-resolution spectra, GPU-assisted OMS, investigation of PTMs across several million spectra in large datasets.