Matchms and PubChem cleaned MS/MS dataset from GNPS
Dataset of MS/MS spectra retrieved from GNPS (https://gnps.ucsd.edu) on 25/01/2021, which underwent extensive metadata cleaning.
Metadata was cleaned and processed using matchms (https://github.com/matchms/matchms) and matchmsextras (https://github.com/matchms/matchmsextras). This largely consited of
- Empty spectra were removed.
- Compound names were cleaned
- charge, adduct, formula, ionmode fields were cleaned and corrected
- parent mass estimated were added (using precursor mz and adduct information)
- inchikey, inchi, and SMILES were checked and corrected
- Spectra which remained without inchi/inchikey/smiles were searched against pubchem based on their mass and name.
This resulted in 210,407 spectra out of which 184,698 are annotated with InChIKey and SMILES and/or InChI.
If you use this dataset for your research please cite the following:
- GNPS, e.g. [Wang, M. et al. Sharing and community curation of mass spectrometry data with GNPS. Nat. Biotechnol. 34, 828–837 (2016)]
- matchms: [ Huber, F. et al. matchms - processing and similarity evaluation of mass spectrometry data. J. Open Source Softw. 5, 2411 (2020) ]
- PubChem: [ Kim, S. et al. PubChem 2019 update: improved access to chemical data. Nucleic Acids Res. 47, D1102–D1109 (2019)]