Published December 3, 2021 | Version v1
Dataset Open

A new ChEMBL dataset for FastTargetPred and target fishing for an exhaustive list of linear tetrapeptides


  • 1. Inserm


Project leader:

  • 1. Inserm


A ChEMBL-v29 dataset was generated to be used with the ligand-based similarity search target prediction engine FastTargetPred ( Using this new dataset, attempts to predict macromolecular targets for a published dataset of 160,000 tetrapeptides was performed.

The dbchembl29 directory contains all the files for FastTargetPred. This command line tool compares using different types of fingerprints, a file containing small query molecules in SDF format (it can be 1 molecule or a collection) to molecules extracted from ChEMBL29. If a match is found, this suggests that your query molecule is similar to a ChEMBL compound and as the ChEMBL compound has bioactivity data against one or more macromolecular target, then this suggests that your query compound could bind to targets that interact with compounds that are similar to the query molecule. The so-called similarity principle in chemistry. Fingerprints are computed with, a collection of Perl and Python scripts for Chemoinformatics and (Structural) Bioinformatics.

To run FastTargetPred with the new ChEMBL 29 data, you just need to unzip the dbchembl29 directory into the FastTargetPred main directory.

The default FastTargetPred commands (e.g., python3 rivaroxaban.sdf, the default command uses ECFP4 fingerprints and a Tanimoto coef of 0.6, rivaroxaban here is the query compound, it is provided in the extra_data directory) will use the data present in the default db directory and thus a curated version of ChEMBL-25 release. It was the version of the ChEMBL database available when FastTargetPred was developed. Since then, many new molecules have been added and this is why we generated the ChEMBL-29 dataset (last ChEMBL version at the time of writing).

To use the new ChEMBL-29 data, you can run the following command:

python3 rivaroxaban.sdf -fp MACCS -tc 0.9 -db dbchembl29/chembl29_active

This applies a similarity search for the query compound (here rivaroxaban, you can for instance move this SDF file in the directory containing the file using MACCS fingerprints, a Tanimoto coefficient threshold of 0.9 and the -db option forces the system to look at the ChEMBL29 curated data and not the default ChEMBL-25 data. This 0.9 value means to focus on molecules very similar to rivaroxaban present in the ChEMBL data. If one is looking for more distantly related compounds, then a value of 0.7 can be used. Users can try different values or try consensus scoring...See FastTargetPred: a program enabling the fast prediction of putative protein targets for input chemical databases. Chaput et al., Bioinformatics. 2020 Aug 15;36(14):4225-4226

The chembl29 directory also contains 714,780 compounds (canonical SMILES strings) extracted from ChEMBL29 that have bioactivity data. Fingerprints could not be computed for 19 molecules that have unusual chemistry. We selected the following thresholds (eg, binding assays, activity against targets less than 20 micro-molar, ChEMBL confidence_score = 6 or above, maximum = 9).

With this new dataset, we attempted to predict potential targets with FastTargetPred for 160,000 input query peptides (4 amino acids, combination should be 20 x 20 x 20 x 20) previously reported by Dewi Prasasty and Perdana Istyastono, Data in brief 27 (2019) 104607. The peptides for which a putative target was predicted are shown in two DataWarrior files with the amino acid sequence of the query peptide, the compounds found to be similar in the ChEMBL29 dataset (fingerprints = ECFP4, Tanimoto 0.6) and thus the compound IDs, the target ChEMBL IDs, mapping to the UniProt database when available, information about disease involvements, Reactome pathway database identifiers. These two DataWarrior files are searchable, can be sorted and hyperlinks to the ChEMBL, UniProt and Reactome databases have been inserted.



Files (481.0 MB)