A consensus compound/bioactivity dataset for data-driven drug design and chemogenomics
- 1. Goethe University Frankfurt
- 2. Goethe University Frankfurt, Structural Genomics Consortium
- 3. Goethe University Frankfurt, Ludwig-Maximilians-Universität München
Description
This is the updated version of the dataset from 10.5281/zenodo.6320761
Information
The diverse publicly available compound/bioactivity databases constitute a key resource for data-driven applications in chemogenomics and drug design. Analysis of their coverage of compound entries and biological targets revealed considerable differences, however, suggesting benefit of a consensus dataset. Therefore, we have combined and curated information from five esteemed databases (ChEMBL, PubChem, BindingDB, IUPHAR/BPS and Probes&Drugs) to assemble a consensus compound/bioactivity dataset comprising 1144648 compounds with 10915362 bioactivities on 5613 targets (including defined macromolecular targets as well as cell-lines and phenotypic readouts). It also provides simplified information on assay types underlying the bioactivity data and on bioactivity confidence by comparing data from different sources. We have unified the source databases, brought them into a common format and combined them, enabling an ease for generic uses in multiple applications such as chemogenomics and data-driven drug design.
The consensus dataset provides increased target coverage and contains a higher number of molecules compared to the source databases which is also evident from a larger number of scaffolds. These features render the consensus dataset a valuable tool for machine learning and other data-driven applications in (de novo) drug design and bioactivity prediction. The increased chemical and bioactivity coverage of the consensus dataset may improve robustness of such models compared to the single source databases. In addition, semi-automated structure and bioactivity annotation checks with flags for divergent data from different sources may help data selection and further accurate curation.
This dataset belongs to the publication: https://doi.org/10.3390/molecules27082513
Structure and content of the dataset
|
ChEMBL ID |
PubChem ID |
IUPHAR ID |
Target |
Activity type |
Assay type | Unit | Mean C (0) | ... | Mean PC (0) | ... | Mean B (0) | ... | Mean I (0) | ... | Mean PD (0) | ... | Activity check annotation | Ligand names | Canonical SMILES C | ... | Structure check (Tanimoto) | Source |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
The dataset was created using the Konstanz Information Miner (KNIME) (https://www.knime.com/) and was exported as a CSV-file and a compressed CSV-file.
Except for the canonical SMILES columns, all columns are filled with the datatype ‘string’. The datatype for the canonical SMILES columns is the smiles-format. We recommend the File Reader node for using the dataset in KNIME. With the help of this node the data types of the columns can be adjusted exactly. In addition, only this node can read the compressed format.
Column content:
- ChEMBL ID, PubChem ID, IUPHAR ID: chemical identifier of the databases
- Target: biological target of the molecule expressed as the HGNC gene symbol
- Activity type: for example, pIC50
- Assay type: Simplification/Classification of the assay into cell-free, cellular, functional and unspecified
- Unit: unit of bioactivity measurement
- Mean columns of the databases: mean of bioactivity values or activity comments denoted with the frequency of their occurrence in the database, e.g. Mean C = 7.5 *(15) -> the value for this compound-target pair occurs 15 times in ChEMBL database
- Activity check annotation: a bioactivity check was performed by comparing values from the different sources and adding an activity check annotation to provide automated activity validation for additional confidence
- no comment: bioactivity values are within one log unit;
- check activity data: bioactivity values are not within one log unit;
- only one data point: only one value was available, no comparison and no range calculated;
- no activity value: no precise numeric activity value was available;
- no log-value could be calculated: no negative decadic logarithm could be calculated, e.g., because the reported unit was not a compound concentration
- Ligand names: all unique names contained in the five source databases are listed
- Canonical SMILES columns: Molecular structure of the compound from each database
- Structure check (Tanimoto): To denote matching or differing compound structures in different source databases
- match: molecule structures are the same between different sources;
- no match: the structures differ. We calculated the Jaccard-Tanimoto similarity coefficient from Morgan Fingerprints to reveal true differences between sources and reported the minimum value;
- 1 structure: no structure comparison is possible, because there was only one structure available;
- no structure: no structure comparison is possible, because there was no structure available.
- Source: From which databases the data come from
Files
Dataset_v1.1.zip
Files
(454.6 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:cd3870b97fb86c19ea91b394e7f45983
|
454.6 MB | Preview Download |
Additional details
Related works
- Is supplement to
- Journal article: 10.3390/molecules27082513 (DOI)