Kinodata-3D: an in silico kinase-ligand complex dataset for kinase-focused machine learning.

Backenköhler, Michael; Groß, Joschka; Wolf, Verena; Volkamer, Andrea

doi:10.5281/zenodo.10852507

Published March 22, 2024 | Version 2.0.0

Dataset Open

Kinodata-3D: an in silico kinase-ligand complex dataset for kinase-focused machine learning.

1. Saarland University

Contributors

Researchers:

Supervisors:

1. Saarland Univerisity
2. Helmholtz Institute for Pharmaceutical Research Saarland
3. Saarland University
4. German Research Centre for Artificial Intelligence

Project Description

Drug discovery pipelines nowadays rely on machine learning models to explore and evaluate large chemical spaces. While the inclusion of 3D complex information is considered to be beneficial, structural ML for affinity prediction suffers from data scarcity.
We provide kinodata-3D, a dataset of ~138 000 docked complexes to enable more robust training of 3D-based ML models for kinase activity prediction (see github.com/volkamerlab/kinodata-3D-affinity-prediction).

Dataset

1. Data

This data set consists of three-dimensional protein-ligand complexes that were generated using computational docking from the OpenEye toolkit. The modeled proteins cover the kinase family for which a fair amount of structural data, i.e. co-crystallized protein-ligand complexes in the PDB, enriched through KLIFS annotations, is available. This enables us to use template docking (OpenEye’s POSIT functionality) in which the ligand placement is guided according to a similar co-crystallized ligand pose. The kinase-ligand pairs to dock are sourced from binding assay data via the public ChEMBL archive, version 33. In particular, we use kinase activity data as curated through the OpenKinome kinodata project. The final protein-ligand complexes are annotated with a predicted RMSD of the docked poses. The RMSD model is a simple neural network trained on a kinase-docking benchmark data set using ligand (fingerprint) similarity, docking score (ChemGauss 4), and Posit probability (see kinodata-3D repository).

The final data set contains in total 138 286 deduplicated kinase-ligand pairs, covering ~98 000 distinct compounds and ~271 distinct kinase structures.

2. File structure

The archive kinodata_3d.zip uses the following file structure

data/raw
| kinodata_docked_with_rmsd.sdf.gz
| pocket_sequences.csv
| mol2/pocket
| 1_pocket.mol2
| ...

The file kinodata_docked_with_rmsd.sdf.gz contains the docked ligand poses and the information on the protein-ligand pair inherited from kinodata. The protein pockets located in mol2/pocket are stored according to the MOL2 file format.

The pocket structures were sourced from KLIFS (klifs.net) and complete the poses in the aforementioned SDF file. The files are named {klifs_structure_id}_pocket.mol2. The structure ID is given in the SDF file along with the ligand poses.

The file pocket_sequences.csv contains all KLIFS pocket sequences relevant to the kinodata-3D dataset.

3. Related code

The code used to create the poses can be found in the kinodata-3D repository. The docking pipeline makes heavy use of the kinoml framework, which in turn uses OpenEye's Posit template docking implementation. The details of the original pipeline can also be found in the manuscript by Schaller et al. (2023). Benchmarking Cross-Docking Strategies for Structure-Informed Machine Learning in Kinase Drug Discovery. bioRxiv.

Files

kinodata_3d.zip

Files (233.5 MB)

Name	Size	Download all
kinodata_3d.zip md5:76fe232d7b4b1dbbcd302808599c1c3b	233.5 MB	Preview Download

Additional details

Is derived from: Software: https://github.com/volkamerlab/kinodata-3D (URL)
Is required by: Software: https://github.com/volkamerlab/kinodata-3D-affinity-prediction (URL)

Updated: 2024-03-22

	All versions	This version
Views	679	291
Downloads	155	99
Data volume	46.7 GB	32.9 GB

Kinodata-3D: an in silico kinase-ligand complex dataset for kinase-focused machine learning.

Contributors

Researchers:

Supervisors:

Dataset

1. Data

2. File structure

3. Related code

Files

kinodata_3d.zip

Files (233.5 MB)

Additional details

Related works

Dates

Kinodata-3D: an in silico kinase-ligand complex dataset for kinase-focused machine learning.

Creators

Contributors

Researchers:

Supervisors:

Description

Dataset

1. Data

2. File structure

3. Related code

Files

kinodata_3d.zip

Files (233.5 MB)

Additional details

Related works

Dates