Published November 21, 2021
| Version v1
Dataset
Open
molxspec: Deep learning models for predicting MS2 spectra from molecular structures
Creators
Description
This repository contains a pre-processed dataset derived from the GNPS public repository of natural product mass spectra as well as pretrained model weights for four different types of model architectures using pytorch (version 1.9.0). The contents are as follows:
- gnps_processed_data.tgz: Contains tab separated files of molecule/MS2 spectra pairs derived from GNPS after filtering for invalid structures, too large molecules (bigger than 2000 M/Z spectra), and structures that yielded valid 3D geometry optimization. The processing steps were done for positive ionization mode (pos_* files), though negative ionization data is also included (neg_* files)
- models.tgz: Contains pytorch format pretrained models for four different architecutres: MLP (a residual block multilayer perceptron trained on ECFP molecular fingerprints), BERT (the same MLP but trained on pretrained representations from the Zinc V1 pretrained ChemBERTa models on SMILES), GCN (a graph convolution architecture), and EGNN (an equivariant graph neural network). Models were trained on pos_processed_gnps_shuffled_with_3d_train.tsv found in the gnps_processed_data.tgz file described previously.
Files
Files
(1.2 GB)
Name | Size | Download all |
---|---|---|
md5:f37bb6830675fd462b77509e2d07d283
|
610.8 MB | Download |
md5:a457c54c7e4f3ab44d56687e77e57599
|
626.8 MB | Download |