Published October 4, 2022 | Version v1
Dataset Open

Enzyme Substrate Classification Dataset for SDRs and SAM-MTases

  • 1. Weill Cornell Medicine
  • 2. University of California, Irvine

Description

This dataset contains sequence information, three-dimensional structures (from AlphaFold2 model), and substrate classification labels for 358 short-chain dehydrogenase/reductases (SDRs) and 953 S-adenosylmethionine dependent methyltransferases (SAM-MTases).

The aminoacid sequences of these enzymes were obtained from the UniProt Knowledgebase (https://www.uniprot.org). The sets of proteins were obtained by querying using InterPro protein family/domain identifiers corresponding to each family: IPR002347 (SDRs) and IPR029063 (SAM-MTases). The query results were filtered by UniProt annotation score, keeping only those with score above 4-out-of-5, and deduplicated by exact sequence matches.

The structures were submitted to the publicly available AlphaFold2 protein structure predictor (J. Jumper et al., Nature, 2021, 596, 583) using the ColabFold notebook (https://colab.research.google.com/github/sokrypton/ColabFold/blob/v1.1-premultimer/batch/AlphaFold2_batch.ipynb, M. Mirdita, S. Ovchinnikov, M. Steinegger, Nature Meth., 2022, 19, 679, https://github.com/sokrypton/ColabFold). The model settings used were  msa_model = MMSeq2(Uniref+Environmental), num_models = 1, use_amber = False, use_templates = True, do_not_overwrite_results = True. The resulting PDB structures are included as ZIP archives

The classification labels were obtained from the substrate and product annotations of the enzyme UniProtKB records. Two approaches were used: substrate clustering based on molecular fingerprints and manual substrate type classification. For the substate clustering, Morgan fingerprints were generated for all enzymatic substrates and products with known structures (excluding cofactors) with radius = 3 using RDKit (https://rdkit.org). The fingerprints were projected onto two-dimensional space using the UMAP algorithm (L. McInnes, J. Healy, 2018, arXiv 1802.03426) and Jaccard metric and clustered using k-means. This procedure generated 9 clusters for SDR substrates and 13 clusters for SAM-MTases. The SMILES representations of the substrates are listed in the SDR_substrates_to_cluster_map_2DIMUMAP.csv and SAM_substrates_to_13clusters_map_2DIMUMAP.csv files.


The following manually defined classification tasks are included for SDRs: NADP/NAD cofactor classification; phenol substrate, sterol substrate, coenzyme A (CoA) substrate. For SAM-MTases, the manually defined classification tasks are: biopolymer (protein/RNA/DNA) vs. small molecule substrate, phenol subsrates, sterol substrates, nitrogen heterocycle substrates. The SMARTS strings used to define the substrate classes are listed in substructure_search_SMARTS.docx.
 

Files

README.txt

Files (67.1 MB)

Name Size Download all
md5:2f56f846ad1eb20c2a96a5034eadc5f5
4.4 kB Preview Download
md5:d3d0ab646f36c0ecf7a25c239ddd3c5b
51.8 MB Preview Download
md5:ab33d6895c4e588305e3384f524a9c2a
15.4 kB Preview Download
md5:67abbcca8ab8b11b8e7c821c51d690cc
10.6 kB Preview Download
md5:1c114ef87f7adf2b2f57988f00a6a50c
387.9 kB Download
md5:27cec5e9664c3c0747b16c8676670d60
31.4 kB Preview Download
md5:4c48b2158535de77bd0e556826c94404
5.0 kB Preview Download
md5:75b8dcd9c79675fa8a6348e223dfbbb0
14.4 MB Preview Download
md5:2013eb1793322266638f832f7eee0201
13.9 kB Preview Download
md5:b8946183ac2e0eef4a7a2dffe3acce58
3.9 kB Preview Download
md5:ea34d586f1e0c8f624b568f57c2ff7e5
109.5 kB Download
md5:83a4f060fc296701b1c2aeee8d9fadd0
44.8 kB Preview Download
md5:67a8f60f79060597e0f89d4f917d8019
6.0 kB Preview Download
md5:4400b15d16d1b10b5d2de4a02bcde597
308.8 kB Download

Additional details

Related works

Is compiled by
Preprint: 10.1101/2022.06.14.496158 (DOI)
Is referenced by
Preprint: 10.1101/2022.09.28.509940 (DOI)