There is a newer version of the record available.

Published December 4, 2023 | Version v1.0.0
Dataset Open

Generative AI for designing and validating easily synthesizable and structurally novel antibiotics: Data and Models

  • 1. ROR icon Stanford University
  • 2. ROR icon McMaster University

Description

This repository contains data and models used in the following paper.

Swanson, K., Liu, G., Catacutan, D., Zou, J. & Stokes, J. Generative AI for designing and validating easily synthesizable and structurally novel antibiotics. Nature Machine Intelligence, 2024.

The data and models are meant to be used with the SyntheMol code. More details about how to use the data and models with the code are available here.

The Data.zip file has the following structure. Note that the numbers for the Data subdirectories correspond to the supplementary data numbers in the paper (e.g., 1_training_data corresponds to Supplementary Data 1).

Data

  1_training_data: The Acinetobacter baumannii inhibition data used to train antibiotic property prediction models.

  2_chembl: Known antibiotic and antibacterial molecules from ChEMBL, which are used to compute the novelty of generated antibiotic candidates.

  4_real_space: Data files and statistics for the Enamine REAL Space. The molecular building blocks file is version 2021 q3-4 while all other REAL Space details are computed from the full enumerated REAL space version 2022 q1-2 (downloaded on August 30, 2022).

  5_generations_clogp: Compounds generated by SyntheMol using Chemprop models trained to predict cLogP.

  6_generations_chemprop: Compounds generated by SyntheMol using Chemprop models trained to predict A. baumannii inhibition.

  7_generations_chemprop_rdkit: Compounds generated by SyntheMol using Chemprop-RDKit models trained to predict A. baumannii inhibition.

  8_generations_random_forest: Compounds generated by SyntheMol using random forest models trained to predict A. baumannii inhibition.

  9_synthesized: Information on the 58 SyntheMol-generated compounds that were successfully synthesized by Enamine.

The Models.zip file contains one folder for each model used in the paper. Note that each model is technically an ensemble of ten individual models, so each directory contains ten model files.

Files

Data.zip

Files (1.1 GB)

Name Size Download all
md5:75faa70ee7e7f155136c31f4c3cee99d
885.3 MB Preview Download
md5:4cbb05926cebc0584b4093b5e94a8daa
197.5 MB Preview Download
md5:71d6c44d6414a6c1ef88a6e1ea1a1fc1
2.2 kB Preview Download

Additional details

Related works

Is published in
Journal article: 10.1038/s42256-024-00809-7 (DOI)