There is a newer version of the record available.

Published November 18, 2024 | Version v1
Technical note Open

Machine learning prediction of novel anthelmintics

  • 1. ROR icon University of Melbourne
  • 2. ROR icon Max Rubner Institut

Description

# Machine learning prediction of novel anthelmintics

This repository contains scripts that were used to obtain the findings reported in the study "Prediction and prioritisation of novel anthelmintic candidates from public databases by using deep learning and available bioactivity data sets" by Taki et al.


Contents:

[1. Small-molecule bioactivity data used for training and validation](#1)
[2. Feature generation, model architecture and training](#2)
[3. Classification model](#3)
[4. Prediction of activities](#4)
[5. Post-processing: MolPort availability](#5)
[6. Clustering of compounds with predicted nematocidal activity](#6)


## 1. Small-molecule bioactivity data used for training and validation<a name="1"></a>

The dataset of 15,162 small-molecule compounds used for training and validation has been published as [DOI:10.5281/zenodo.10929251](https://doi.org/10.5281/zenodo.10929251)


## 2. Feature generation, model architecture and training<a name="2"></a>

The scripts used are compiled in the folder [01_model_training](/01_model_training)

| Script                         | Description                                                                                | Input files                                                                                                                     |
|--------------------------------|--------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------|
| `dl_mlp_class_run_training.py` | Wrapper script to run `dl_mlp_class_v1.4.py` with variable network architectures           | CSV-formatted file with compounds in SMILES notation and annotated labels; file with line-separated list of Mordred descriptors |
| `dl_mlp_class_v1.4.py`         | Main script to perform training or prediction (`mode`) of MLPs with specified architecture | CSV-formatted file with compounds in SMILES notation and annotated labels; file with line-separated list of Mordred descriptors |
| `mordred_descriptors.txt`      | Line-separated list of Mordred descriptors                                                 | n/a                                                                                                                             |


## 3. Classification model<a name="3"></a>

The best classification model is located in folder [02_model](/02_model).

| File                             | Description                             |
|----------------------------------|-----------------------------------------|
| `m1002a_label_dictionary.json`   | Classification labels used by the model |
| `m1002a_model_architecture.json` | MLP archticture                         |
| `m1002a_model_weights.h5`        | The weights of the trained model        |


## 4. Prediction of acitivities<a name="4"></a>

Classification of compounds with respect to their activity labels was done using the wrapper script `dl_mlp_class_run_prediction.py` and the main script `dl_mlp_class_v1.4.py` located in the folder `model_training`.

A dataset of 14.2 million compounds was downloaded from the ZINC15 database at <https://zinc15.docking.org/> and used as a search library. The downloaded ZINC15 data are not included in this repository.

Files in folder [03_activity_prediction](/03_activity_prediction):

| File                             | Description                                                                          | Input files                                                                                                                                                     | Output files                           |
|----------------------------------|--------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------| 
| `dl_mlp_class_run_prediction.py` | Wrapper script to run `dl_mlp_class_v1.4.py` in `prediction` mode                    | CSV-formatted file of library compounds with SMILES representation or h5-formatted file of encoded compounds; dictionary, architecture and weights of the model | n/a                                    |
| `prepare_zinc_v3.py`             | Script to encode compounds of the ZINC15 search library                              | CSV-formatted file of library compounds with SMILES representation                                                                                              | h5-formatted file of encoded compounds |
| `zinc_15_m1002a_active.csv.gz`   | Compounds from the tested ZINC15 search library with predicted label `active`        | n/a                                                                                                                                                             | n/a                                    |
| `zinc_15_m1002a_weak.csv.gz`     | Compounds from the tested ZINC15 search library with predicted label `weakly active` | n/a                                                                                                                                                             | n/a                                    |
| `zinc_15_m1002a_inactive.csv.gz` | Compounds from the tested ZINC15 search library with predicted label `none`          | n/a                                                                                                                                                             | n/a                                    |


## 5. Post-processing: MolPort availability<a name="5"></a>

The scripts used in this step are located in the folder [04_post_processing](/04_post_processing) and were executed using Jupyter Notebooks.

| File                | Description                                                            |
|---------------------|------------------------------------------------------------------------|
| `molport_search.py` | Searches <MolPort> for availability of compounds                       |
| `patent_scraper.py` | Scrapes the site <https://patents.google.com> for a list of compounds  |


## 6. Clustering of compounds with predicted nematocidal activity<a name="6"></a>

The scripts used in this step are located in the folder [05_clustering](/05_clustering) and were executed using Jupyter Notebooks.

| File         | Description                                                             |
|--------------|-------------------------------------------------------------------------|
| `k_means.py` | ??? Uses a pandas dataframe as input? Need to describe how to get this. |
| ???          | ???                                                                     |

 

Files

m1002a_label_dictionary.json

Files (127.8 MB)

Name Size Download all
md5:c7e1b38ba3b2263977e6f1e990b8356c
585 Bytes Download
md5:2398a5ebe33222f60ed96f9dbadd9836
1.3 kB Download
md5:1993cb08f62bbb2a54c9122e92b39f51
38.8 kB Download
md5:30400e23dd5f9e52f0bbd53e0de380d8
3.6 kB Download
md5:81392ab4bf0e3a1680ae14ccb61b0757
50 Bytes Preview Download
md5:80284ad23e8cca495e49d6cb4b5a8193
2.3 kB Preview Download
md5:053786f0629773a5b345151055b89b8b
3.0 MB Download
md5:3c1a59766bed48fd2e513d9a30c649c4
4.9 kB Download
md5:d0ddda507ef51e16c6d59dbedc295db6
1.6 kB Preview Download
md5:7026457086dddfd66a5bd6900478b6df
7.0 kB Download
md5:995a23fbc2e107296693d9b4d1ee1c0b
5.8 kB Download
md5:4338582eb2b6aefd49b55388a33fb2b2
6.9 kB Preview Download
md5:fa5a01590dd4bef600ed60aabe99aeb0
1.6 MB Download
md5:4c5a7f9e391d543546172bfc9f635969
123.0 MB Download
md5:9ff188255af91671b075dfae701d6960
136.4 kB Download

Additional details

Funding

Australian Research Council
New Anti-Parasitic Drugs for a Global Veterinary Market LP190101209
Australian Research Council
Illuminating Genomic Dark Matter to Develop New Interventions for Parasites LP180101085
Australian Research Council
Artificial intelligence to explore and combat eukaryotic pathogens LP220200614