Presentation Open Access
Bittremieux, Wout; Müller, Emmanuel; Valkenborg, Dirk; Martens, Lennart; Goethals, Bart; Laukens, Kris
Pattern mining of mass spectrometry quality control data
Mass spectrometry is widely used to identify proteins based on the mass distribution of their peptides. Unfortunately, because of its inherent complexity, the results of a mass spectrometry experiment can be subject to a large variability. As a means of quality control, recently several qualitative metrics have been defined . Prior research either evaluated these quality control metrics individually through traditional statistical means , or globally, using a multivariate statistics approach . Unfortunately, looking at specific metrics individually is insufficient because the different stages of a mass spectrometry experiment do not function in isolation, instead they can have an effect on each other. Therefore, a multivariate approach tries to overcome this defect by evaluating all metrics simultaneously. However, this global analysis might obfuscate potentially interesting observations that are only expressed by a subset of metrics. As a result, specialized pattern mining techniques that take this duality into account can provide additional insights when analyzing quality control data.
Both traditional univariate statistics and multivariate statistics are insufficient to capture all interesting patterns in very high-dimensional data. Specifically, because mass spectrometry is a complicated process, a single quality metric will be insufficient to detect problems in a timely manner. On the other hand, the different parts of a mass spectrometry experiment can be considered to be produced by a different generating mechanism, which means that specific sets of metrics can exhibit dissimilar behavior.
On the other hand, subspace mining algorithms  try to find a suitable subset of the original feature space in which (dis)similar items can be found. Unfortunately, current subspace mining algorithms originating from the theoretical data mining field are often not applicable on real-world data. For example, the mass spectrometry quality control data is characterized by a high dimensionality. Due to the curse of dimensionality, most existing algorithms either fail or have an excessive run time, or result into a degenerate subspace cluster detection. Therefore, practical and sound algorithms are mandatory, which is why we have developed a novel subspace clustering algorithm that takes these considerations into account. Using this subspace clustering algorithm we can detect prevalent subspace clusters indicating normal mass spectrometer behavior, as well as subspace outliers indicating unexpected or faulty behavior.
The awareness has risen that suitable quality control information is mandatory to assess the validity of a mass spectrometry experiment. Recently there have been efforts to standardize this quality control information, which will facilitate its dissemination along with experimental data . This will result in a large amount of as of yet untapped information, which can be leveraged by making use of specific data mining techniques in order to harness the full power of this new information.
We have developed a novel and practical subspace clustering algorithm which takes into account the high dimensionality of the data. Current results yield a wide range of quality control parameters indicating correct or faulty behavior. These patterns could subsequently be used to optimized mass spectrometry instrument settings.
1) Rudnick, P. A. et al. Performance metrics for liquid chromatography-tandem mass spectrometry systems in proteomics analyses. Molecular & Cellular Proteomics 9, 225–241 (2010).
2) Wang, X. et al. QC metrics from CPTAC raw LC-MS/MS data interpreted through multivariate statistics. Analytical Chemistry 86, 2497–2509 (2014).
3) Müller, E., Günnemann, S., Assent, I. & Seidl, T. Evaluating clustering in subspace projections of high dimensional data. Proceedings of the VLDB Endowment 2, 1270–1281 (2009).
4) Walzer, M. et al. qcML: An exchange format for quality control metrics from mass spectrometry experiments. Molecular & Cellular Proteomics 13, 1905–1913 (2014).