Published November 21, 2021 | Version v1
Dataset Open

molxspec: Deep learning models for predicting MS2 spectra from molecular structures

Creators

Description

This repository contains a pre-processed dataset derived from the GNPS public repository of natural product mass spectra as well as pretrained model weights for four different types of model architectures using pytorch (version 1.9.0). The contents are as follows:

  • gnps_processed_data.tgz: Contains tab separated files of molecule/MS2 spectra pairs derived from GNPS after filtering for invalid structures, too large molecules (bigger than 2000 M/Z spectra), and structures that yielded valid 3D geometry optimization. The processing steps were done for positive ionization mode (pos_* files), though negative ionization data is also included (neg_* files)
  • models.tgz: Contains pytorch format pretrained models for four different architecutres: MLP (a residual block multilayer perceptron trained on ECFP molecular fingerprints), BERT (the same MLP but trained on pretrained representations from the Zinc V1 pretrained ChemBERTa models on SMILES), GCN (a graph convolution architecture), and EGNN (an equivariant graph neural network). Models were trained on pos_processed_gnps_shuffled_with_3d_train.tsv found in the gnps_processed_data.tgz file described previously.

Files

Files (1.2 GB)

Name Size Download all
md5:f37bb6830675fd462b77509e2d07d283
610.8 MB Download
md5:a457c54c7e4f3ab44d56687e77e57599
626.8 MB Download