Published November 10, 2025 | Version v2
Dataset Open

In-silico transformation analysis for exposomics

  • 1. ROR icon Icahn School of Medicine at Mount Sinai
  • 2. Icahn School of Medicine at Mt Sinai

Description

These are supporting information for the manuscript  - "Expanding the chemical exposome using in-silico transformation analysis – an example using insecticides"

We analyze the chemical reaction data available in the PubChem Data to generate a workflow for in-silico transformation analysis of exposome compounds. The legend file contains details about each table. 

Tables details are : 

Table 1: PubChem Reactions, contains chemical and biochemical reactions data downloaded from PubChem database. Each entry represents biochemical or pathway or transformation reactions along with metadata relevant to reaction covering a wide range of fields including organism, compound, reaction and bibliographic specific identifiers. The organism specific identifiers like taxid and taxname, which represent the NCBI taxonomy ID, and the scientific name of the organism associated with each reaction. The compound and reaction specific fields, cids, cidsreactant, and cidsproduct, provide details of list of PubChem compound ID (CID) associated with each reaction, including reactants and products side. The name of the specific pathway reaction belongs to, description of reaction, PubChem hyperlink, HTML formatted version of reaction, text representation of reaction, and directionality of reaction (i.e. unidirectional or bidirectional) were captured in following fields, name, definition, url, htmlequation, equation, and direction and otherdirections, respectively. Regulatory and enzymatic information including enzyme names, EC numbers, gene and protein accessions (Uniprot ID), and identifiers from specific sources relevant to each entry was recorded in control, enzyme, ecs, geneids, protacxns, gid, and srcid, respectively. The PubChem CID relevant to transformation reactions were recorded in predecessor, predecessorcid, transformation, successor, and successorcid fields. Biosystem and rhid provide context for biological systems and reaction knowledge base (RHEA) identifier for each entry, if any. Lastly, bibliographic and source information is included in source, externalid, file_name (local system), url, dois, pmids, pmcids, citations, evidencedoi, evidenceref, datasetdoi, datasetref, sourcecomment, and sourcecommentfull. This dataset has not been curated; this would contain redundant and incomplete entries. These chemical reactions were utilized to generate a chemical transformation template library, used in template-guided in-silico transformation modeling.

Table 2: Chemical Reaction Analysis provides detailed information about chemical reaction analysis using rxn-insight, a Python-based library developed for automated chemical reaction analysis. The reaction is encoded in SMILES format, with atom-mapped reactions and sanitized reaction strings represented as REACTION, MAPPED_REACTION, and SANITIZED_MAPPED_REACTION. The CLASS column gives information about specific reaction types, such as oxidation, reduction, and alkylation. N_REACTANTS and N_PRODUCTS show the number of reactants and products for each reaction. Functional groups found in each reactant and product are stored in FG_REACTANTS and FG_PRODUCTS, and any BY-PRODUCTS are also listed. The presence of ring structures in reactants and products, and their involvement in the reaction center, is captured in PARTICIPATING_RINGS_REACTANTS, PARTICIPATING_RINGS_PRODUCTS, and ALL_RINGS_PRODUCTS. Any solvents, reagents, or catalysts that help the reaction are summarized in SOLVENT, REAGENT, and CATALYST. The molecular structure of the compound, the roles of nitrogen, oxygen, and sulfur in the reaction center, and any changes in ring structures—such as formation, breaking, or modifications—are highlighted in the SCAFFOLD, NOS_REACTION_CENTER, and RING_CHANGING columns. SANITIZED_TRANSFORMATION_MAPPING provides detailed information about the atom map numbers of transformed atoms between the reactants and products, while ECS lists any associated enzyme commission numbers. Finally, transformation templates that describe the patterns of chemical transformations during the reaction are captured using rdchiral and summarized in the reac_temp column.

Table 3: Substructure Library provides atom-level insights on reactive sites, atoms that are modified during reaction, in reactants molecules. Each entry for reactant molecules is linked with a unique rxn_id. The smiles column contains the SMILES representation of the molecule. The Changed Atom SMARTS column shows the SMARTS pattern of atoms that undergo a chemical transformation during the reaction. The Changed Atom Tags column specifies the map number of these atoms, which helps match reactants and products. The Expanded Changed Atom Tags provide a map number for the neighboring atoms that are directly connected to the changed atoms. If the neighboring atoms include functional groups, the map number for the entire functional group is included. The Changed Atom Index and Expanded Changed Atom Index columns display the atom index for these changed atoms and expanded changed atoms, respectively. The Fragment Smarts and Fragment SMILES column shows the SMARTS and SMILES patterns of reactive site substructures, capturing changed and expanded changed atoms along with properties like explicit hydrogen connections, atom connectivity, and explicit bonds.

Table 4: Insecticides Selected for Template-Guided Transformation Analysis lists the insecticides chosen for template-guided transformation modeling. It includes the PubChem identifier (Compound CID), names, alternative names (synonyms), key molecular descriptors (Molecular Formula, Polar Area, Complexity, XLogP, Heavy Atom Count, H-Bond Donor Count, H-Bond Acceptor Count, and Rotatable Bond Count), structure descriptors (InChI, SMILES, InChIKey, and IUPAC Name), mass-related data (Exact Mass and Monoisotopic Mass), charge, isotopic properties (Isotopic Atom Count), and stereochemistry (Total Atom Stereo Count, Defined Atom Stereo Count, Undefined Atom Stereo Count, Total Bond Stereo Count, Defined Bond Stereo Count, and Undefined Bond Stereo Count). Links to literature (PubChem Literature Count), patents (PubChem Patent Count), biological relevance (bioassays), linked MeSH headings, and other metadata were also gathered from PubChem.

Table 5: Transformation Products of Insecticides gives an overview of transformation products generated in silico. Each entry includes the query compound, the corresponding predicted products, and the transformation template and its unique identifier (rxn_id) applied to generate those products. The match smiles column represents the reactant SMILES part of sanitized mapped reaction, which reactive sites matched against query compound. The rxn_smiles_with_query_cmpd column was constructed by replacing matched smiles with query compound smiles. The mapped_rxn_query_cmpd represents the reactant side of newly constructed reaction smiles along transformed products. The query_cmpd_to_prdt_correspondence column depicts the ranking of generated products based on the correspondence between query compound and generated products. Column cid capture PubChem CID of the generated product metabolites if they are known previously in PubChem database.

 

Table 6: ADME-Tox Properties of Query and Product Metabolites, presents an extensive profile of query chemical compounds or products, focusing on their physicochemical properties, drug-likeness, toxicity, ADMET (Absorption, Distribution, Metabolism, Excretion, and Toxicity) predictions, and comparative percentile rankings against approved DrugBank compounds. Each entry includes a Parent Compound/Product name and its corresponding smiles representation. The table lists important molecular descriptors like molecular weight, logP, hydrogen bond acceptors, hydrogen bond donors, compliance with Lipinski's rules, QED (quantitative estimate of drug likeness), stereo centers, and tpsa (topological polar surface area). It also provides toxicity and pharmacokinetic predictions through fields such as AMES, BBB Martins, Bioavailability Ma, and various enzyme interaction predictions (e.g., CYP1A2 Veith, CYP3A4 Substrate CarbonMangels). Additional safety and efficacy indicators include Carcinogens Lagunin, ClinTox (clinical toxicity), DILI (Drug induced Liver Injury), HIA (Human Intestinal Absorption) Hou, and predictions for nuclear receptor activity (e.g., NR-AR, NR-ER, NR-PPAR-gamma). The dataset also provides experimental and predicted values for permeability (PAMPA NCATS, Caco2 Wang), clearance (Clearance Hepatocyte AZ, Clearance Microsome AZ), half-life, solubility, and distribution volume (VDss Lombardo). Each property has a percentile score that compares the compound to approved DrugBank drugs, giving a reference for drug likeness. Metadata columns such as cid, query cid, top 1 percentile count, and query product map help track compound identity and performance across queries.

Table 7. Transformation products for pesticide Parathion. Table headers are same as in the Table 5.

 

Files

Table 1_PubChem_Reactions.csv

Files (2.8 GB)

Name Size Download all
md5:cb6bc4b8099a8fa60fe72b9d8639d23e
1.2 GB Preview Download
md5:65d7495750e31d0910a55dfd6b658392
365.3 MB Preview Download
md5:f45dec177da766cc60555b625862984c
116.1 MB Preview Download
md5:e4fb8ff0a0727dd46b011b4b4a34926b
1.5 MB Preview Download
md5:3126e4b334849cdea37c4f1315eb5a2e
1.0 GB Preview Download
md5:a043cf35f4f923d854ed05385a624ac9
39.5 MB Preview Download
md5:8598feee3f326fe36918535d2a29eee6
1.6 MB Preview Download
md5:db53edf44bf648e5026d4de5f19d0b87
21.0 kB Download

Additional details

Funding

National Institutes of Health
Exposome Correlation and Interpretation Database (ECID) 5U24ES035386-02