Journal article Open Access
Stein, Stephen E.; Scott, Donald R.
Five algorithms proposed in the literature for library search identification of unknown compounds from their low resolution mass spectra were optimized and tested by matching test spectra against reference spectra in the NIST-EPA-NIH Mass Spectral Database. The algorithms were probability-based matching (PBM), dot-product, Hertz et al. similarity index. Euclidean distance, and absolute value distance. The test set consisted of 12,592 alternate spectra of about 8000 compounds represented in the database. Most algorithms were optimized by varying their mass weighting and intensity scaling factors. Rank in the list of candidate compounds was used as the criterion for accuracy. The best performing algorithm (75% accuracy for rank 1) was the dot-product function that measures the cosine of the angle between spectra represented as vectors. Other methods in order of performance were the Euclidean distance (72%), absolute value distance (68%), PBM (65%), and Hertz et al. (64%). Intensity scaling and mass weighting were important in the optimized algorithms with the square root of the intensity scale nearly optimal and the square or cube the best mass weighting power. Several more complex schemes also were tested, but had little effect on the results. A modest improvement in the performance of the dot-product algorithm was made by adding a term that gave additional weight to relative peak intensities for spectra with many peaks in common.