Presentation Open Access
Bittremieux, Wout; Laukens, Kris
Precursor-free and fast spectral library search using approximate nearest neighbor techniques
In a mass-spectrometry proteomics experiments only a minority of spectra can be confidently identified, with several of the unidentified spectra due to unconsidered modifications. We here present a spectral library search engine using an approximate nearest neighbor scheme to perform precursor-free searches, capable of obtaining mass-tolerant spectrum identifications while significantly speeding up the computation time.
Generally, fewer than half of all spectra are identified in a mass-spectrometry proteomics experiment. However, recent research has shown that when using a mass-tolerant database search, a large proportion of the unassigned spectra can be identified as modified peptides. [Chick2015] Unfortunately, by opening up the search space in such a fashion, an extremely high number of candidates has to be checked to determine a peptide-spectrum match (PSM) for each spectrum, resulting in an excessive computation time.
On the other hand, instead of sequence database search engines, spectral library search engines can be used to identify spectra as well. Because spectral libraries use previously observed spectra to determine the PSM's, its advantages are a reduced search space and very effective similarity matching. Here we apply the idea of mass-tolerant peptide-spectrum matching using a spectral library, by using an approximate nearest neighbor technique to quickly and effectively further reduce the increased search space.
Although spectral libraries by definition exhibit a reduced search space compared to sequence database search engines, when performing a mass-tolerant search, still tens to hundreds of thousands of candidate matches have to be checked, up to almost the entire spectral library, as indicated in Figure 1. However, because when using spectral libraries all library spectra are known beforehand, we can leverage this limited search space to only retrieve the most relevant candidates.
Spectral libraries mostly employ the cosine distance as similarity measure to determine valid matches. Then, each spectrum can be considered as a vector in a (very) high-dimensional space. Generally, for a query spectrum its similarity with all library spectra within the precursor mass window has to be computed. However, by using approximate nearest neighbor techniques in this vector space, the number of candidate matches to be considered can be drastically reduced. Approximate nearest neighbor techniques based on the locality-sensitive hashing principle are able to partition the data into 'buckets' consisting of very similar vectors. This is done by iteratively hashing vectors to buckets based on their position compared to random split vectors. This way the data space can be reduced until only a few, very similar, vectors remain in each bucket. Then, for each query spectrum, instead of having to examine the whole data space, only the bucket(s) with the most similar library spectra have to be retrieved to determine the best PSM.
Results & Discussion
We have implemented a mass-tolerant approximate nearest neighbor spectral library search engine in Python. Preliminary results show that approximate nearest neighbor techniques can drastically reduce the search space and speed up queries. Furthermore, this speed-up can be tuned at the expense of some accuracy.
Additionally, because in this approach candidate spectra are nog longer filtered on precursor mass, performing precursor-free, mass-tolerant, searches is implicitly supported. Figure 2 shows that most PSM's are due to unmodified peptides (a mass difference around 0 Da), while on the other hand, various modified peptides can be identified as well, where based on the precursor mass difference the modification(s) can be determined.
Using approximate nearest neighbor techniques to speed up spectral library search engines seems a promising technique to perform mass-tolerant searches to identify modified peptides, resulting in a record number of spectrum identifications that can be obtained in a minimum amount of time.
Chick, J. M. et al. A mass-tolerant database search identifies a large proportion of unassigned spectra in shotgun proteomics as modified peptides. Nature Biotechnology 33, 743–749 (2015).