Multi-objective Kinodata3D: Activity and Pose-Quality Datasets for Kinase–Ligand Binding Affinity Prediction
Authors/Creators
Description
Multi-objective Kinodata3D: Activity and Pose-Quality Datasets for Kinase–Ligand Binding Affinity Prediction
Description
This dataset accompanies the manuscript “No Pose Left Behind: Integrating Activity and Structural Data with Uncertainty-Aware Multiobjective Learning for Kinase Inhibitor Prediction.”
The upload contains the datasets used to train and evaluate multi-objective-Kinodata-3D, a multi-objective E(3)-invariant graph neural network for kinase–ligand binding affinity prediction. The model combines experimental activity data with computationally generated kinase–ligand structures of varying pose quality. It jointly predicts binding affinity, activity uncertainty, and pose quality, allowing structurally reliable complexes to contribute more strongly to the activity objective while lower-quality poses inform uncertainty estimation.
The archive contains two complementary datasets:
-
Kinodata activity dataset (
kinodata)
Kinase–ligand complexes paired with experimental activity labels, used to train the binding-affinity and activity-uncertainty objective. -
Cross-docked pose-quality dataset (
davidsdocked)
Computationally generated cross-docked kinase–ligand poses paired with RMSD-derived pose-quality labels, used to train the pose-quality objective. This cross-docking workflow is modified verison of the cross-docking pipeline presented in Schaller, David A., et al. "Benchmarking cross-docking strategies in kinase drug discovery." Journal of Chemical Information and Modeling 64.23 (2024): 8848-8858.
Together, these datasets support multi-objective training on activity and structural reliability, enabling uncertainty-aware structure-based machine learning for kinase inhibitor prediction.
Folder structure
Two different files are provided: raw_datasets_multiobjective_kinodata.tar.gz containing the raw data and processed_datasets_multiobjective_kinodata_10rmsd.tar.gz, which contains the processed datasets. Note that the processed dataset has been included for transparency and reproducibility purposes, but the user can generate their own splits with the raw data. They should both be extracted in a foder called data so that the full dataset is organized as follows:
data/
├── raw/
│ ├── mol2/
│ ├── klifs_ids.csv
│ ├── kinodata_docked_v2.sdf.gz
│ ├── pocket_sequences.csv
│ ├── posit_combined.sdf
│ ├── posit_results.csv
│ └── docking_benchmark_dataset.csv
│
└── processed/
├── kinodata/
│ ├── kinodata_docked_v2.pt
│ ├── pre_transform.pt
│ ├── pre_filter.pt
│ ├── post_filter.pt
│ └── filter_predicted_rmsd_le10.00/
│ ├── filter.log
│ ├── kinodata_docked_v2.pt
│ ├── pre_transform.pt
│ ├── pre_filter.pt
│ ├── post_filter.pt
│ ├── random-k-fold/
│ └── original-scaffold-k-fold-act-1000/
│
└── davidsdocked/
├── posit_combined.pt
├── pre_transform.pt
├── pre_filter.pt
├── post_filter.pt
└── filter_predicted_rmsd_le10.00/
├── filter.log
├── posit_combined.pt
├── pre_transform.pt
├── pre_filter.pt
├── post_filter.pt
├── random-k-fold/
└── original-scaffold-k-fold-3000/
Contents
raw/
The raw/ directory contains the raw data input files used to construct the processed datasets.
-
mol2/: ligand structure files. -
klifs_ids.csv: KLIFS identifiers used to map kinase structures. -
kinodata_docked_v2.sdf.gz: docked Kinodata structures used for the activity dataset. -
pocket_sequences.csv: kinase binding-site sequence information. -
posit_combined.sdf: cross-docked ligand poses used for the pose-quality dataset. -
posit_results.csv: docking output metadata for the cross-docked poses. -
docking_benchmark_dataset.csv: benchmark metadata used to associate generated poses with reference structures and RMSD-derived pose-quality labels.
processed/kinodata/
The processed/kinodata/ directory contains the processed activity dataset. The main .pt files are PyTorch/PyTorch Geometric objects generated from the raw activity data and used by the training pipeline.
The filter_predicted_rmsd_le10.00/ subdirectory contains the processed activity dataset after applying the RMSD-based filtering used for the paper-release experiments. This folder includes the corresponding processed dataset object and data splits.
processed/davidsdocked/
The processed/davidsdocked/ directory contains the processed cross-docked pose-quality dataset. The main .pt files are PyTorch/PyTorch Geometric objects generated from the cross-docked structures and used to train the pose-quality prediction objective.
The filter_predicted_rmsd_le10.00/ subdirectory contains the processed pose-quality dataset after applying the RMSD-based filtering used for the paper-release experiments. This folder includes the corresponding processed dataset object and data splits.
Included data splits
The processed folders include split directories used for model training and evaluation, including:
-
random-k-fold/ -
original-scaffold-k-fold/for the activity dataset -
original-scaffold-k-fold/for the pose-quality dataset
These splits were used to train and evaluate the multi-objective model while keeping the activity and pose-quality objectives aligned across training, validation, and test partitions.
How to use
-
Download the compressed archive from Zenodo.
-
Extract it into the root directory of the
multi-objective-kinodata-3Drepository:
tar -xzf <archive_name>.tar.gz
After extraction, the repository should contain a data/ directory with the raw/ and processed/ subfolders described above.
-
Install the repository following the instructions in the Github repository README.
-
Use the processed
.ptfiles and split folders for training and evaluation of the multi-objective model.
Related software
The code used to process these datasets, train the multi-objective model, and reproduce the experiments is available at:
https://github.com/openkinome/multi-objective-kinodata-3D
For reproducibility, please use the tagged paper release associated with this dataset.
Notes
The processed datasets are provided for direct use with the training and evaluation scripts in the associated GitHub repository. They are included to support reproducibility of the experiments reported in the manuscript.
The raw files are provided to support transparency and allow users to reproduce or modify the dataset construction pipeline. When the data-curation code in the GitHub repository is run with user-defined settings, the corresponding processed/ directory is generated automatically. Therefore, users who wish to reproduce the exact paper experiments can use the processed datasets directly, while users who wish to apply different filtering, splitting, or curation settings can regenerate the processed files from the raw data.
Files
Files
(2.5 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:4a487b44ce1bc08ad0b175767e2d8e59
|
2.2 GB | Download |
|
md5:0fb9f4bdd9aa2bc027ae49148e90f8bf
|
316.7 MB | Download |
Additional details
Software
- Repository URL
- https://github.com/openkinome/multi-objective-kinodata-3D