A consensus compound/bioactivity dataset for data-driven drug design and chemogenomics

Isigkeit, Laura; Chaikuad, Apirat; Merk, Daniel

doi:10.5281/zenodo.6398019

Published March 30, 2022 | Version 1.1

Dataset Open

A consensus compound/bioactivity dataset for data-driven drug design and chemogenomics

1. Goethe University Frankfurt
2. Goethe University Frankfurt, Structural Genomics Consortium
3. Goethe University Frankfurt, Ludwig-Maximilians-Universität München

This is the updated version of the dataset from 10.5281/zenodo.6320761

Information

The diverse publicly available compound/bioactivity databases constitute a key resource for data-driven applications in chemogenomics and drug design. Analysis of their coverage of compound entries and biological targets revealed considerable differences, however, suggesting benefit of a consensus dataset. Therefore, we have combined and curated information from five esteemed databases (ChEMBL, PubChem, BindingDB, IUPHAR/BPS and Probes&Drugs) to assemble a consensus compound/bioactivity dataset comprising 1144648 compounds with 10915362 bioactivities on 5613 targets (including defined macromolecular targets as well as cell-lines and phenotypic readouts). It also provides simplified information on assay types underlying the bioactivity data and on bioactivity confidence by comparing data from different sources. We have unified the source databases, brought them into a common format and combined them, enabling an ease for generic uses in multiple applications such as chemogenomics and data-driven drug design.

The consensus dataset provides increased target coverage and contains a higher number of molecules compared to the source databases which is also evident from a larger number of scaffolds. These features render the consensus dataset a valuable tool for machine learning and other data-driven applications in (de novo) drug design and bioactivity prediction. The increased chemical and bioactivity coverage of the consensus dataset may improve robustness of such models compared to the single source databases. In addition, semi-automated structure and bioactivity annotation checks with flags for divergent data from different sources may help data selection and further accurate curation.

This dataset belongs to the publication: https://doi.org/10.3390/molecules27082513

Structure and content of the dataset

**Dataset structure**
ChEMBL ID	PubChem ID	IUPHAR ID	Target	Activity type	Assay type	Unit	Mean C (0)	...	Mean PC (0)	...	Mean B (0)	...	Mean I (0)	...	Mean PD (0)	...	Activity check annotation	Ligand names	Canonical SMILES C	...	Structure check (Tanimoto)	Source

The dataset was created using the Konstanz Information Miner (KNIME) (https://www.knime.com/) and was exported as a CSV-file and a compressed CSV-file.

Except for the canonical SMILES columns, all columns are filled with the datatype ‘string’. The datatype for the canonical SMILES columns is the smiles-format. We recommend the File Reader node for using the dataset in KNIME. With the help of this node the data types of the columns can be adjusted exactly. In addition, only this node can read the compressed format.

Column content:

ChEMBL ID, PubChem ID, IUPHAR ID: chemical identifier of the databases
Target: biological target of the molecule expressed as the HGNC gene symbol
Activity type: for example, pIC₅₀
Assay type: Simplification/Classification of the assay into cell-free, cellular, functional and unspecified
Unit: unit of bioactivity measurement
Mean columns of the databases: mean of bioactivity values or activity comments denoted with the frequency of their occurrence in the database, e.g. Mean C = 7.5 *(15) -> the value for this compound-target pair occurs 15 times in ChEMBL database
Activity check annotation: a bioactivity check was performed by comparing values from the different sources and adding an activity check annotation to provide automated activity validation for additional confidence
- no comment: bioactivity values are within one log unit;
- check activity data: bioactivity values are not within one log unit;
- only one data point: only one value was available, no comparison and no range calculated;
- no activity value: no precise numeric activity value was available;
- no log-value could be calculated: no negative decadic logarithm could be calculated, e.g., because the reported unit was not a compound concentration
Ligand names: all unique names contained in the five source databases are listed
Canonical SMILES columns: Molecular structure of the compound from each database
Structure check (Tanimoto): To denote matching or differing compound structures in different source databases
- match: molecule structures are the same between different sources;
- no match: the structures differ. We calculated the Jaccard-Tanimoto similarity coefficient from Morgan Fingerprints to reveal true differences between sources and reported the minimum value;
- 1 structure: no structure comparison is possible, because there was only one structure available;
- no structure: no structure comparison is possible, because there was no structure available.
Source: From which databases the data come from

Files

Dataset_v1.1.zip

Files (454.6 MB)

Name	Size	Download all
Dataset_v1.1.zip md5:cd3870b97fb86c19ea91b394e7f45983	454.6 MB	Preview Download

Additional details

Is supplement to: Journal article: 10.3390/molecules27082513 (DOI)

European Commission
EUbOPEN - EUbOPEN: Enabling and Unlocking biology in the OPEN 875510

	All versions	This version
Views	3,781	3,119
Downloads	354	296
Data volume	205.6 GB	174.1 GB

A consensus compound/bioactivity dataset for data-driven drug design and chemogenomics

Files

Dataset_v1.1.zip

Files (454.6 MB)

Additional details

Related works

Funding

A consensus compound/bioactivity dataset for data-driven drug design and chemogenomics

Creators

Description

Files

Dataset_v1.1.zip

Files (454.6 MB)

Additional details

Related works

Funding