Dive deeper with Depthcharge: A Transformer Toolkit for Modeling Mass Spectrometry Data
Authors/Creators
- 1. Talus Bioscience
- 2. University of Antwerpe
- 3. University of Washington
Description
Introduction
Deep learning has revolutionized the analysis of mass spectra; from predicting the tandem mass spectrum generated by an analyte, to sequencing peptides from mass spectra de novo, the neural network models that underpin deep learning are now ubiquitous. In recent years, a neural network architecture called the transformer has become the architecture of choice for developing state-of-the-art deep learning models, in domains including natural language processing, protein structure prediction, and importantly, the analysis of mass spectra. However, every new model developed for mass spectra has essentially been forced to start from scratch. Here, we introduce depthcharge, an open-source deep learning framework that provides the building blocks for transformer models of mass spectra and the analytes that generate them.
Methods
A tandem mass spectrum can be described as a bag of peaks where each peak is defined as a pair of m/z and intensity values. The distances between m/z values, the m/z values themselves, and their associated intensities provide structural information about the analyte; hence, we hypothesize that the self-attention mechanism which characterizes the transformer architecture would be ideal for learning the relationships among peaks within a mass spectrum, similar to the relationships among words within a sentence. Additionally, peptides and small molecules can be represented as sequences of tokens (either a peptide sequence or SMILES string). Depthcharge provides PyTorch modules to parse, batch, and encode these data structures and use them to build transformer models.
Results
Depthcharge provides the building blocks to build transformer models for mass spectra and common analytes, such as peptides and small molecules. Unlike other previous architectures, such as recurrent neural networks, transformers lack a built-in representation for the order of elements in the input sequence; position in the sequences is generally encoded as a sequence of sinusoids that is summed with a representation of each element. We use this quality of transformers to our advantage to model mass spectra: the m/z values are encoded as a series of sinusoids and summed with a learned representation of the intensity. We illustrate both how this process takes place and demonstrate that this method provides a high fidelity representation of a mass spectrum.
We then present a series of case studies on the various ways that depthcharge can be used, demonstrating the configurations required for, predicting peptide properties such as collisional cross section, predicting the b and y ion intensities generated from a peptide precursor, and co-embedding peptides and mass spectral into the same latent space. In each case, we build a minimal model atop depthcharge and outline the components required to build it. We then compare each against current tools in the field, demonstrating that even these minimal models are capable of achieving high-quality results. Finally, we show that these models require relatively few lines of code to implement due to the tools provided by depthcharge.
We aim for depthcharge to provide a user-friendly, foundational framework that will propel biological discovery through new models of mass spectrometry data. Depthcharge is open-source and available under the permissive Apache 2.0 license: https://github.com/wfondrie/depthcharge
Files
2023-07-16_depthcharge_not-animated.pdf
Files
(12.6 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:29ccd930d1e00cfe01c0a2eac0c87ee5
|
12.6 MB | Preview Download |