Presentation Open Access
Bittremieux, Wout; Valkenborg, Dirk; Laukens, Kris
Optimized open modification spectral library searching using approximate nearest neighbor techniques
Generally, fewer than half of all spectra can be identified in a mass spectrometry proteomics experiment. Recent research has shown that a large proportion of the spectra that remain unassigned can be identified as peptides containing unexpected post-translational modifications (PTMs).
A common approach to identify arbitrary PTMs is to perform a so-called open modification search using a very wide precursor mass window to take potentially modified peptides into account. Unfortunately, by opening up the search space in such a fashion, an extremely high number of candidate peptide-spectrum matches has to be evaluated to obtain the identification for each spectrum, resulting in an excessive computation time.
We have implemented a spectral library search strategy using an approximate nearest neighbor technique to restrict the search space to take only the most similar candidate spectra irrespective of their precursor mass into account when identifying query spectra.
First, by using spectral libraries the search space is limited to peptides that have been observed previously. Despite this somewhat limited search space, performing an open modification search using a wide precursor mass window still causes each query spectrum to have to be compared to a (very) large number of candidate library spectra. Therefore, second, we use an approximate nearest neighbor indexing technique to quickly restrict the search space to include only the most similar candidate matches for each query spectrum.
Because traditional indexing techniques break down for very high-dimensional data due to the curse of dimensionality, these techniques are unsuitable to restrict the search space in our case. However, by making use of (computationally cheap) random projections the data space consisting of vector representations for each spectrum in the spectral library can be iteratively partitioned to form an indexing tree of subspaces that will each contain highly similar spectrum vectors. Therefore, instead of having to compare all of the query spectra to a multitude of irrelevant library spectra, only the most similar library spectra can be retrieved from the approximate nearest neighbor tree index in logarithmic time, after which a more advanced matching score can be computed to determine the optimal spectrum-spectrum matches. Furthermore, the non-deterministic effects of the random space partitioning can largely be obviated by constructing a ‘forest’ of multiple index trees and by searching multiple nodes in the index trees.
Another crucial advantage is that because the retrieval of candidate library spectra from the approximate nearest neighbor index need not take the spectra’s precursor mass into account, but instead the most similar spectra are directly retrieved irrespective of their precursor mass, this inherently supports open modification searching.
We will show how this approximate nearest neighboring indexing can drastically reduce the search space when identifying unknown spectra, resulting in significant speed-ups. This is especially beneficial when performing an open modification search, which enables us to identify a multitude of peptides with previously unconsidered PTMs at limited computational expense. Specifically, similar to existing open modification approaches we are able to achieve up to a 50 % increase in identifications corresponding to modified peptides, albeit at a significantly reduced computational expense. Furthermore, we will present details on how the achieved speed-up can be tuned at the expense of some accuracy.
Spectral library indexing independent of the precursor mass to extremely fast harness the full power of open modification searching.