Machine learning prediction of novel anthelmintics
Authors/Creators
Description
# Machine learning prediction of novel anthelmintics
This repository contains scripts that were used to obtain the findings reported in the study "Prediction and prioritisation of novel anthelmintic candidates from public databases by using deep learning and available bioactivity data sets" by Taki et al.
Contents:
[1. Small-molecule bioactivity data used for training and validation](#1)
[2. Feature generation, model architecture and training](#2)
[3. Classification model](#3)
[4. Prediction of activities](#4)
[5. Post-processing: MolPort availability](#5)
[6. Clustering of compounds with predicted nematocidal activity](#6)
## 1. Small-molecule bioactivity data used for training and validation<a name="1"></a>
The dataset of 15,162 small-molecule compounds used for training and validation has been published as [DOI:10.5281/zenodo.10929251](https://doi.org/10.5281/zenodo.10929251)
## 2. Feature generation, model architecture and training<a name="2"></a>
The scripts used are compiled in the folder [01_model_training](/01_model_training)
| Script | Description | Input files |
|--------------------------------|--------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------|
| `dl_mlp_class_run_training.py` | Wrapper script to run `dl_mlp_class_v1.4.py` with variable network architectures | CSV-formatted file with compounds in SMILES notation and annotated labels; file with line-separated list of Mordred descriptors |
| `dl_mlp_class_v1.4.py` | Main script to perform training or prediction (`mode`) of MLPs with specified architecture | CSV-formatted file with compounds in SMILES notation and annotated labels; file with line-separated list of Mordred descriptors |
| `mordred_descriptors.txt` | Line-separated list of Mordred descriptors | n/a |
## 3. Classification model<a name="3"></a>
The best classification model is located in folder [02_model](/02_model).
| File | Description |
|----------------------------------|-----------------------------------------|
| `m1002a_label_dictionary.json` | Classification labels used by the model |
| `m1002a_model_architecture.json` | MLP archticture |
| `m1002a_model_weights.h5` | The weights of the trained model |
## 4. Prediction of acitivities<a name="4"></a>
Classification of compounds with respect to their activity labels was done using the wrapper script `dl_mlp_class_run_prediction.py` and the main script `dl_mlp_class_v1.4.py` located in the folder `model_training`.
A dataset of 14.2 million compounds was downloaded from the ZINC15 database at <https://zinc15.docking.org/> and used as a search library. The downloaded ZINC15 data are not included in this repository.
Files in folder [03_activity_prediction](/03_activity_prediction):
| File | Description | Input files | Output files |
|----------------------------------|--------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------|
| `dl_mlp_class_run_prediction.py` | Wrapper script to run `dl_mlp_class_v1.4.py` in `prediction` mode | CSV-formatted file of library compounds with SMILES representation or h5-formatted file of encoded compounds; dictionary, architecture and weights of the model | n/a |
| `prepare_zinc_v3.py` | Script to encode compounds of the ZINC15 search library | CSV-formatted file of library compounds with SMILES representation | h5-formatted file of encoded compounds |
| `zinc_15_m1002a_active.csv.gz` | Compounds from the tested ZINC15 search library with predicted label `active` | n/a | n/a |
| `zinc_15_m1002a_weak.csv.gz` | Compounds from the tested ZINC15 search library with predicted label `weakly active` | n/a | n/a |
| `zinc_15_m1002a_inactive.csv.gz` | Compounds from the tested ZINC15 search library with predicted label `none` | n/a | n/a |
## 5. Post-processing: MolPort availability<a name="5"></a>
The scripts used in this step are located in the folder [04_post_processing](/04_post_processing) and were executed using Jupyter Notebooks.
| File | Description |
|---------------------|------------------------------------------------------------------------|
| `molport_search.py` | Searches <MolPort> for availability of compounds |
| `patent_scraper.py` | Scrapes the site <https://patents.google.com> for a list of compounds |
## 6. Clustering of compounds with predicted nematocidal activity<a name="6"></a>
The scripts used in this step are located in the folder [05_clustering](/05_clustering) and were executed using Jupyter Notebooks.
| File | Description |
|--------------|-------------------------------------------------------------------------|
| `k_means.py` | ??? Uses a pandas dataframe as input? Need to describe how to get this. |
| ??? | ??? |
Files
m1002a_label_dictionary.json
Files
(127.8 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:c7e1b38ba3b2263977e6f1e990b8356c
|
585 Bytes | Download |
|
md5:2398a5ebe33222f60ed96f9dbadd9836
|
1.3 kB | Download |
|
md5:1993cb08f62bbb2a54c9122e92b39f51
|
38.8 kB | Download |
|
md5:30400e23dd5f9e52f0bbd53e0de380d8
|
3.6 kB | Download |
|
md5:81392ab4bf0e3a1680ae14ccb61b0757
|
50 Bytes | Preview Download |
|
md5:80284ad23e8cca495e49d6cb4b5a8193
|
2.3 kB | Preview Download |
|
md5:053786f0629773a5b345151055b89b8b
|
3.0 MB | Download |
|
md5:3c1a59766bed48fd2e513d9a30c649c4
|
4.9 kB | Download |
|
md5:d0ddda507ef51e16c6d59dbedc295db6
|
1.6 kB | Preview Download |
|
md5:7026457086dddfd66a5bd6900478b6df
|
7.0 kB | Download |
|
md5:995a23fbc2e107296693d9b4d1ee1c0b
|
5.8 kB | Download |
|
md5:4338582eb2b6aefd49b55388a33fb2b2
|
6.9 kB | Preview Download |
|
md5:fa5a01590dd4bef600ed60aabe99aeb0
|
1.6 MB | Download |
|
md5:4c5a7f9e391d543546172bfc9f635969
|
123.0 MB | Download |
|
md5:9ff188255af91671b075dfae701d6960
|
136.4 kB | Download |
Additional details
Funding
- Australian Research Council
- New Anti-Parasitic Drugs for a Global Veterinary Market LP190101209
- Australian Research Council
- Illuminating Genomic Dark Matter to Develop New Interventions for Parasites LP180101085
- Australian Research Council
- Artificial intelligence to explore and combat eukaryotic pathogens LP220200614