Kinodata-3D: an in silico kinase-ligand complex dataset for kinase-focused machine learning.
Creators
Contributors
Researchers:
Supervisors:
Description
Project Description
Drug discovery pipelines nowadays rely on machine learning models to explore and evaluate large chemical spaces. While the inclusion of 3D complex information is considered to be beneficial, structural ML for affinity prediction suffers from data scarcity.
We provide kinodata-3D, a dataset of ~138 000 docked complexes to enable more robust training of 3D-based ML models for kinase activity prediction (see github.com/volkamerlab/kinodata-3D-affinity-prediction).
Dataset
1. Data
This data set consists of three-dimensional protein-ligand complexes that were generated using computational docking from the OpenEye toolkit. The modeled proteins cover the kinase family for which a fair amount of structural data, i.e. co-crystallized protein-ligand complexes in the PDB, enriched through KLIFS annotations, is available. This enables us to use template docking (OpenEye’s POSIT functionality) in which the ligand placement is guided according to a similar co-crystallized ligand pose. The kinase-ligand pairs to dock are sourced from binding assay data via the public ChEMBL archive, version 33. In particular, we use kinase activity data as curated through the OpenKinome kinodata project. The final protein-ligand complexes are annotated with a predicted RMSD of the docked poses. The RMSD model is a simple neural network trained on a kinase-docking benchmark data set using ligand (fingerprint) similarity, docking score (ChemGauss 4), and Posit probability (see kinodata-3D repository).
The final data set contains in total 138 286 deduplicated kinase-ligand pairs, covering ~98 000 distinct compounds and ~271 distinct kinase structures.
2. File structure
The archive kinodata_3d.zip uses the following file structure
data/raw
| kinodata_docked_with_rmsd.sdf.gz
| pocket_sequences.csv
| mol2/pocket
| 1_pocket.mol2
| ...
The file kinodata_docked_with_rmsd.sdf.gz contains the docked ligand poses and the information on the protein-ligand pair inherited from kinodata. The protein pockets located in mol2/pocket are stored according to the MOL2 file format.
The pocket structures were sourced from KLIFS (klifs.net) and complete the poses in the aforementioned SDF file. The files are named {klifs_structure_id}_pocket.mol2. The structure ID is given in the SDF file along with the ligand poses.
The file pocket_sequences.csv contains all KLIFS pocket sequences relevant to the kinodata-3D dataset.
3. Related code
The code used to create the poses can be found in the kinodata-3D repository. The docking pipeline makes heavy use of the kinoml framework, which in turn uses OpenEye's Posit template docking implementation. The details of the original pipeline can also be found in the manuscript by Schaller et al. (2023). Benchmarking Cross-Docking Strategies for Structure-Informed Machine Learning in Kinase Drug Discovery. bioRxiv.
Files
kinodata_3d.zip
Files
(233.5 MB)
Name | Size | Download all |
---|---|---|
md5:76fe232d7b4b1dbbcd302808599c1c3b
|
233.5 MB | Preview Download |
Additional details
Related works
- Is derived from
- Software: https://github.com/volkamerlab/kinodata-3D (URL)
- Is required by
- Software: https://github.com/volkamerlab/kinodata-3D-affinity-prediction (URL)
Dates
- Updated
-
2024-03-22