Machine learning prediction of novel anthelmintics

Taki, Aya; Kapp, Louis; Hall, Ross; Gasser, Robin; Hofmann, Andreas

doi:10.5281/zenodo.14511148

Published December 19, 2024 | Version v2

Technical note Open

Machine learning prediction of novel anthelmintics

1. University of Melbourne
2. Max Rubner Institut

# Machine learning prediction of novel anthelmintics

This repository contains scripts that were used to obtain the findings reported in the study "Prediction and prioritisation of novel anthelmintic candidates from public databases by using deep learning and available bioactivity data sets" by Taki et al.

## Table of Contents

1. [Small-molecule bioactivity data used for training and validation](#1)
2. [Feature generation, model architecture and training](#2)
3. [Classification model](#3)
4. [Prediction of activities](#4)
5. [Clustering of compounds with predicted nematocidal activity](#5)
6. [Post-processing](#6)

## 1. Small-molecule bioactivity data used for training and validation<a name="1"></a>

The dataset of 15,162 small-molecule compounds used for training and validation has been published as [DOI:10.5281/zenodo.10929251](https://doi.org/10.5281/zenodo.10929251).

## 2. Feature generation, model architecture and training<a name="2"></a>

The scripts used are compiled in the folder [02_model_training](02_model_training).

| Script | Description | Input files |
|--------------------------------|--------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------|
| [dl_mlp_class_run_training.py](02_model_training/dl_mlp_class_run_training.py) | Wrapper script to run [dl_mlp_class_v1.py](02_model_training/dl_mlp_class_v1.py) with variable network architectures | CSV-formatted file with compounds in SMILES notation and annotated labels; file with line-separated list of Mordred descriptors |
| [dl_mlp_class_v1.py](02_model_training/dl_mlp_class_v1.py) | Main script to perform training or prediction (`mode`) of MLPs with specified architecture | CSV-formatted file with compounds in SMILES notation and annotated labels; file with line-separated list of Mordred descriptors |
| [mordred_descriptors.txt](02_model_training/mordred_descriptors.txt) | Line-separated list of Mordred descriptors | n/a |

## 3. Classification model<a name="3"></a>

The best classification model is located in folder [03_model](/03_model).

| File | Description |
|----------------------------------|-----------------------------------------|
| [m1002a_label_dictionary.json](03_model/m1002a_label_dictionary.json) | Classification labels used by the model |
| [m1002a_model_architecture.json](03_model/m1002a_model_architecture.json) | MLP archticture |
| [m1002a_model_weights.h5](03_model/m1002a_model_weights.h5) | The weights of the trained model |

## 4. Prediction of activities<a name="4"></a>

Classification of compounds with respect to their activity labels was done using the wrapper script [dl_mlp_class_run_prediction.py](04_activity_prediction/dl_mlp_class_run_prediction.py) and the main script [dl_mlp_class_v1.py](04_activity_prediction/dl_mlp_class_v1.py) located in the folder [03_activity_prediction](04_activity_prediction).

A dataset of 14.2 million compounds was downloaded from the [ZINC15 database](https://zinc15.docking.org) and used as a search library. The downloaded ZINC15 data is not included in this repository.

Files in folder [04_activity_prediction](04_activity_prediction):

| File | Description | Input files | Output files |
|----------------------------------|--------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------|
| [dl_mlp_class_run_prediction.py](04_activity_prediction/dl_mlp_class_run_prediction.py) | Wrapper script to run [dl_mlp_class_v1.py](02_model_training/dl_mlp_class_v1.py) in `prediction` mode | CSV-formatted file of library compounds with SMILES representation or h5-formatted file of encoded compounds; dictionary, architecture and weights of the model | n/a |
| [prepare_zinc_v3.py](04_activity_prediction/prepare_zinc_v3.py) | Script to encode compounds of the ZINC15 search library | CSV-formatted file of library compounds with SMILES representation | h5-formatted file of encoded compounds |
| [zinc_15_m1002a_active.csv.gz](04_activity_prediction/zinc_15_m1002a_active.csv.gz) | Compounds from the tested ZINC15 search library with predicted label `active` | n/a | n/a |
| [zinc_15_m1002a_weak.csv.gz](04_activity_prediction/zinc_15_m1002a_weak.csv.gz) | Compounds from the tested ZINC15 search library with predicted label `weakly active` | n/a | n/a |
| [zinc_15_m1002a_inactive.csv.gz](04_activity_prediction/zinc_15_m1002a_inactive.csv.gz) | Compounds from the tested ZINC15 search library with predicted label `none` | n/a | n/a |

## 5. Clustering of compounds with predicted nematocidal activity<a name="5"></a>

The scripts used in this step are located in the folder [05_clustering](05_clustering).

| File | Description | Input file | Output file |
|------|-------------|------------|-------------|
| [preprocess.py](05_clustering/preprocessing/preprocess.py) | Script that computes feature vectors of compounds following [Hadipour, H., Liu, C., Davis, R. et al](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04667-1)'s approach. It calls `mol2global` from [global_feature_generation.py](05_clustering/preprocessing/global_feature_generation.py), `mol2local` from [local_feature_generation.py](05_clustering/preprocessing/local_feature_generation.py) and [`combine_and_drop_features`](05_clustering/preprocessing/combine_and_drop_features.py) | [Feather](https://arrow.apache.org/docs/python/feather.html)-formatted file with SMILES representation of all compounds that need to be preprocessed | File in [NPY](https://numpy.org/devdocs/reference/generated/numpy.lib.format.html)-format that containes the combined, final features for each compound |
| [vae/train.py](05_clustering/vae/train.py) | Script that trains a Variational Autoencoder (VAE) which is defined in [vae.py](05_clustering/vae/vae.py). | File in [NPY](https://numpy.org/devdocs/reference/generated/numpy.lib.format.html)-format containing the final features for each compound | A [Pytorch model checkpoint](https://pytorch.org/tutorials/beginner/saving_loading_models.html), containing the hyperparameter configuration and weights |
| [vae/compute_embeddings.py](05_clustering/vae/compute_embeddings.py) | Script that generates embeddings from preprocessed features using the trained VAE. The embeddings are afterwards used for clustering. | File in [NPY](https://numpy.org/devdocs/reference/generated/numpy.lib.format.html)-format containing the final features for each compound. Also, the weights of the trained model are needed from a given [checkpoint](05_clustering/vae/checkpoint/model.pt) | File in [NPY](https://numpy.org/devdocs/reference/generated/numpy.lib.format.html)-format that containes the embeddings for all compounds |
| [vae/checkpoint/config.json](05_clustering/vae/checkpoint/config.json) | Hyperparameter configurations of the trained VAE model, the embeddings of which, showed best performance on the activity_prediction task | | |
| [vae/checkpoint/model.pt](05_clustering/vae/checkpoint/config.json) | Checkpoint containing the weights of the trained VAE model, the embeddings of which, showed best performance on the activity_prediction task | |
| [k_means.py](05_clustering/k_means.py) | Computes a label (1-k) for each compound via k-means clustering and stores it along with evaluation metrics for different hyperparameters | File in [NPY](https://numpy.org/devdocs/reference/generated/numpy.lib.format.html)-format that containes the embeddings for all compounds (output of [vae/compute_embeddings.py](05_clustering/vae/compute_embeddings.py)) | Multiple files in [NPY](https://numpy.org/devdocs/reference/generated/numpy.lib.format.html)-format. One file contains the computed label for each compound (1-k) and the other files contain the evaluation scores (silhouette, davies bouldin, calinski harabasz) for different hyperparameter configurations |
| [tsne.py](05_clustering/tsne.py) | Visualizes the k-means-clustered coordinates after reducing their dimensionality to 2D | Files in [NPY](https://numpy.org/devdocs/reference/generated/numpy.lib.format.html)-format that contains the computed label for each compound (1-k) (output from [k_means.py](05_clustering/k_means.py)) and the embedding for each compound (output from [vae/compute_embeddings.py](05_clustering/vae/compute_embeddings.py)). Also, the path to a [Feather](https://arrow.apache.org/docs/python/feather.html)-formatted file containing the SMILES strings of all compounds in the correct order is needed. | A [Feather](https://arrow.apache.org/docs/python/feather.html)-file containing the TSNE coordinates of every compound and a graphical visualization (can be saved as any image format) |

## 6. Post-processing<a name="6"></a>

The scripts used in this step are located in the folder [06_post_processing](06_post_processing).

| File | Description |
|---------------------|------------------------------------------------------------------------|
| [molport_search.py](06_post_processing/molport_search.py) | Searches for availability of compounds for purchase on [Molport](https://www.molport.com/shop/index) |
| [patent_scraper.py](06_post_processing/patent_scraper.py) | Scrapes [Google Patents](https://patents.google.com) for a list of compounds and searches for keywords in the patent titles/snippets |
| [lipinski_checker.py](06_post_processing/lipinski_checker.py) | Checks a list of compounds for adherence to the Lipinsky Rule of 5 |

Files

data_publication_scripts.zip

Files (127.7 MB)

Name	Size	Download all
data_publication_scripts.zip md5:bfa251deda7d673f54fc2dce6a33c084	127.7 MB	Preview Download

Additional details

Is supplemented by: Dataset: 10.5281/zenodo.10929251 (DOI)

Australian Research Council
New Anti-Parasitic Drugs for a Global Veterinary Market LP190101209
Australian Research Council
Illuminating Genomic Dark Matter to Develop New Interventions for Parasites LP180101085
Australian Research Council
Artificial intelligence to explore and combat eukaryotic pathogens LP220200614

	All versions	This version
Views	114	90
Downloads	96	15
Data volume	3.3 GB	2.0 GB

data_publication_scripts.zip

Files (127.7 MB)

Related works

Funding

Machine learning prediction of novel anthelmintics

Authors/Creators

Description

Files

data_publication_scripts.zip

Files (127.7 MB)

Additional details

Related works

Funding