Published June 25, 2021 | Version v1
Journal article Open

Large-scale tandem mass spectrum clustering using fast nearest neighbor searching

  • 1. University of California San Diego
  • 2. University of Antwerp
  • 3. University of Washington

Description

Rationale: Advanced algorithmic solutions are necessary to process the ever increasing amounts of mass spectrometry data that is being generated. Here we describe the falcon spectrum clustering tool for efficient clustering of millions of MS/MS spectra.

Methods: falcon succeeds in efficiently clustering large amounts of mass spectral data using advanced techniques for fast spectrum similarity searching. First, high-resolution spectra are binned and converted to low-dimensional vectors using feature hashing. Next, the spectrum vectors are used to construct nearest neighbor indexes for fast similarity searching. The nearest neighbor indexes are used to efficiently compute a sparse pairwise distance matrix without having to exhaustively perform all pairwise spectrum comparisons within the relevant precursor mass tolerance. Finally, density-based clustering is performed to group similar spectra into clusters.

Results: Several state-of-the-art spectrum clustering tools were evaluated using a large draft human proteome dataset consisting of 25 million spectra, indicating that alternative tools produce clustering results with different characteristics. Notably, falcon generates larger highly pure clusters than alternative tools, leading to a larger reduction in data volume without the loss of relevant information for more efficient downstream processing.

Conclusions: falcon is a highly efficient spectrum clustering tool. It is publicly available as open source under the permissive BSD license at https://github.com/bittremieux/falcon.

 

Files

Bittremieux2021.pdf

Files (1.1 MB)

Name Size Download all
md5:c02a2620d7d064e0e62ab6a2caed011c
1.1 MB Preview Download

Additional details

Related works

Is new version of
Preprint: 10.1101/2021.02.05.429957 (DOI)
Is supplemented by
Dataset: 10.5281/zenodo.4721496 (DOI)