Published June 17, 2023 | Version 2.0.0
Dataset Open

Datasets for evaluating scalable supervised learning for synthesize-on-demand chemical libraries

Description

This repository contains datasets for the manuscript "Evaluating scalable supervised learning for synthesize-on-demand chemical libraries":

  • ams_all_preds.csv.gz: The AMS dataset predictions when using an RF or baseline model trained on the training dataset. Includes the predicted score and rank from each model for each compound. We started with 8,434,707 AMS compounds and detected that 247,025 were in the LC or MLPCN training data. These were removed from the AMS list, leaving 8,187,682 compounds to score. The compound matching was done on the SMILES that we canonicalized in rdkit.
  • ams_order_results.csv.gz: Information about the 1,024 compounds purchased from the AMS library. Excludes the 4 AMS compounds that were incompletely dissolved. Includes the chemical feature representation, information from the vendor, RF and baseline model predictions, screening results, and clustering results.
  • baseline_weight.npy: The saved Similarity Baseline model, which consists of the active compounds in the training data. This model was used to score the AMS library. See the GitHub repository for code to load the model and make predictions on new compounds.
  • cdd_training_data.tar.gz: The LC1234 and MLPCN PriA-SSB screening data exported from CDD.
  • enamine_costs_clustered_v3_with_nneighbor.csv.gz: Contains 5,620 Enamine compounds that were selected based on the RF prediction score and availability. This file also contains the Taylor-Butina cluster ID when clustering the training compounds, 1,024 tested AMS compounds, and top-ranked Enamine compounds at a 0.4 threshold. The nearest neighbor compounds in the training and AMS sets are also included along with compound information from Enamine, RF model scores, and chemical feature representations.
  • enamine_dose_response_curve_plots.xlsx: Images of the dose response curves from all three runs on the 68 Enamine compounds. If a compound was tested multiple times, multiple curves are shown in the same plot. The compound structure images and SMILES are exported from CDD, not generated with RDKit.
  • enamine_dose_response_curves.tsv: The dose response curve summaries from all three runs on the 68 Enamine compounds. If a compound was tested multiple times, only the highest-quality dose response curve was used.
  • enamine_final_list.csv.gz: The final 100 filtered compounds from enamine_top_10000.csv.gz. Contains compound information from Enamine as well as RF model scores, chemical feature representations, and clustering results.
  • enamine_PriA-SSB_dose_response_data.tar.gz: The dose response screening data from all three runs on the 68 Enamine compounds. The 2021-06-16 run was originally screened on 2020-08-24. 2021-06-16 is the date the compound identities were corrected. This run contains two 1,536 well plates.
  • enamine_top_10000.csv.gz: Top 10,000 predictions from the Enamine REAL dataset using the selected RF model. Contains compound information from Enamine as well as RF model scores, chemical feature representations, and clustering results.
  • master_df.csv.gz: The output of preprocessing the files in cdd_training_data.tar.gz. Contains 441,900 rows.
  • random_forest_classification_139.pkl: The saved RF classification model with hyperparameter ID 139. This model was used to score the AMS and Enamine REAL libraries. See the GitHub repository directory for code to load the model and make predictions on new compounds.
  • train_ams_real_cluster.csv.gz: Contains cluster IDs for Taylor-Butina clustering at a 0.4 threshold applied to the training compounds, 1,024 tested AMS compounds, and top-ranked compounds from Enamine. Includes the chemical features, dataset to which the compound belongs, leader compound for each cluster, and whether the compound is a known hit.
  • training_df_single_fold.csv.gz: This is all ten folds in training_folds.tar.gz merged for convenience. Contains 427,300 compounds.
  • training_df_single_fold_with_ams_clustering.csv.gz: Contains cluster IDs for Taylor-Butina clustering applied to the 427,300 training compounds and the 1,024 tested AMS compounds. Different clustering results are shown at the 0.2, 0.3, and 0.4 thresholds. Includes the leader compound for each cluster. Although the training and AMS compounds were clustered jointly, only the training compounds' clusters are shown. The AMS compounds' clusters are in ams_order_results.csv.gz.
  • training_folds.tar.gz: The LC1234 and MLPCN training data split into ten folds. This dataset with 427,300 compounds was used for cross validation and model selection. This dataset is derived from master_df.csv.gz.

If you use these datasets in a publication, please cite:

Moayad Alnammi, Shengchao Liu, Spencer S. Ericksen, Gene E. Ananiev, Andrew F. Voter, Song Guo, James L. Keck, F. Michael Hoffmann, Scott A. Wildman, Anthony Gitter. Evaluating scalable supervised learning for synthesize-on-demand chemical librariesJournal of Chemical Information and Modeling 2023.

See PubChem AID 1272365, AID 1918986, and the associated publications for details about the PriA-SSB screening data. The screening datasets were compiled from three separate sources that should all be cited if the training dataset is used in a publication:

Files

Files (875.1 MB)

Name Size Download all
md5:ab949db8abccb41faf217574e21ea017
229.7 MB Download
md5:d0fcf6010f201cead716490a5cfa443f
130.9 kB Download
md5:2bbac3288c35b8538cf761f7b33f6051
1.1 MB Download
md5:2478cf9d0eede07d20c2cb9b19c6d308
25.4 MB Download
md5:8ea1d1012abea572e286cc819b6d85cc
816.8 kB Download
md5:0433e789eef0cc0e52f8291d349fdeb2
835.6 kB Download
md5:fd08a57ef4268d1badafbb7d9ce7905d
32.0 kB Download
md5:eaec8ec2f89175e52b1674f1b6071256
14.2 kB Download
md5:a80561b33c29cc0268b7088305c53d80
54.9 kB Download
md5:a7a39ba606784b4ec2d9e3b344482b8b
1.3 MB Download
md5:fb2852e0dec0e3380666a39b4684e144
47.0 MB Download
md5:0d2c19dd27dfe1ad6d17a20bc27d580b
408.1 MB Download
md5:880e157abae7f3347009263d57c4ba16
34.7 MB Download
md5:a2126d57e2acfd6c626398fbc329e318
37.9 MB Download
md5:a208030506ebd5d819f917024eb660db
46.8 MB Download
md5:b04ec0b82cf2223449bf7d8d4a84c30c
41.3 MB Download

Additional details

References

  • Moayad Alnammi, Shengchao Liu, Spencer S. Ericksen, Gene E. Ananiev, Andrew F. Voter, Song Guo, James L. Keck, F. Michael Hoffmann, Scott A. Wildman, Anthony Gitter. Evaluating scalable supervised learning for synthesize-on-demand chemical libraries. Journal of Chemical Information and Modeling 2023. doi:10.1021/acs.jcim.3c00912