Multi-objective Kinodata3D: Activity and Pose-Quality Datasets for Kinase–Ligand Binding Affinity Prediction

López-Ríos de Castro, Raquel; Backenköhler, Michael; Groß, Joschka; Chodera, John D.; Volkamer, Andrea

doi:10.5281/zenodo.20433386

Published May 28, 2026 | Version v1

Dataset Open

Multi-objective Kinodata3D: Activity and Pose-Quality Datasets for Kinase–Ligand Binding Affinity Prediction

1. Saarland University
2. Freie Universität Berlin
3. Memorial Sloan Kettering Cancer Center
4. Charité – Universitätsmedizin Berlin
5. Helmholtz Institute for Pharmaceutical Research Saarland

Multi-objective Kinodata3D: Activity and Pose-Quality Datasets for Kinase–Ligand Binding Affinity Prediction

Description

This dataset accompanies the manuscript “No Pose Left Behind: Integrating Activity and Structural Data with Uncertainty-Aware Multiobjective Learning for Kinase Inhibitor Prediction.”

The upload contains the datasets used to train and evaluate multi-objective-Kinodata-3D, a multi-objective E(3)-invariant graph neural network for kinase–ligand binding affinity prediction. The model combines experimental activity data with computationally generated kinase–ligand structures of varying pose quality. It jointly predicts binding affinity, activity uncertainty, and pose quality, allowing structurally reliable complexes to contribute more strongly to the activity objective while lower-quality poses inform uncertainty estimation.

The archive contains two complementary datasets:

Kinodata activity dataset (kinodata)
Kinase–ligand complexes paired with experimental activity labels, used to train the binding-affinity and activity-uncertainty objective.
Cross-docked pose-quality dataset (davidsdocked)
Computationally generated cross-docked kinase–ligand poses paired with RMSD-derived pose-quality labels, used to train the pose-quality objective. This cross-docking workflow is modified verison of the cross-docking pipeline presented in Schaller, David A., et al. "Benchmarking cross-docking strategies in kinase drug discovery." Journal of Chemical Information and Modeling 64.23 (2024): 8848-8858.

Together, these datasets support multi-objective training on activity and structural reliability, enabling uncertainty-aware structure-based machine learning for kinase inhibitor prediction.

Folder structure

Two different files are provided: raw_datasets_multiobjective_kinodata.tar.gz containing the raw data and processed_datasets_multiobjective_kinodata_10rmsd.tar.gz, which contains the processed datasets. Note that the processed dataset has been included for transparency and reproducibility purposes, but the user can generate their own splits with the raw data. They should both be extracted in a foder called data so that the full dataset is organized as follows:

data/
├── raw/
│   ├── mol2/
│   ├── klifs_ids.csv
│   ├── kinodata_docked_v2.sdf.gz
│   ├── pocket_sequences.csv
│   ├── posit_combined.sdf
│   ├── posit_results.csv
│   └── docking_benchmark_dataset.csv
│
└── processed/
    ├── kinodata/
    │   ├── kinodata_docked_v2.pt
    │   ├── pre_transform.pt
    │   ├── pre_filter.pt
    │   ├── post_filter.pt
    │   └── filter_predicted_rmsd_le10.00/
    │       ├── filter.log
    │       ├── kinodata_docked_v2.pt
    │       ├── pre_transform.pt
    │       ├── pre_filter.pt
    │       ├── post_filter.pt
    │       ├── random-k-fold/
    │       └── original-scaffold-k-fold-act-1000/
    │
    └── davidsdocked/
        ├── posit_combined.pt
        ├── pre_transform.pt
        ├── pre_filter.pt
        ├── post_filter.pt
        └── filter_predicted_rmsd_le10.00/
            ├── filter.log
            ├── posit_combined.pt
            ├── pre_transform.pt
            ├── pre_filter.pt
            ├── post_filter.pt
            ├── random-k-fold/
            └── original-scaffold-k-fold-3000/

mol2/: ligand structure files.
klifs_ids.csv: KLIFS identifiers used to map kinase structures.
kinodata_docked_v2.sdf.gz: docked Kinodata structures used for the activity dataset.
pocket_sequences.csv: kinase binding-site sequence information.
posit_combined.sdf: cross-docked ligand poses used for the pose-quality dataset.
posit_results.csv: docking output metadata for the cross-docked poses.
docking_benchmark_dataset.csv: benchmark metadata used to associate generated poses with reference structures and RMSD-derived pose-quality labels.

`processed/kinodata/`

The processed/kinodata/ directory contains the processed activity dataset. The main .pt files are PyTorch/PyTorch Geometric objects generated from the raw activity data and used by the training pipeline.

The filter_predicted_rmsd_le10.00/ subdirectory contains the processed activity dataset after applying the RMSD-based filtering used for the paper-release experiments. This folder includes the corresponding processed dataset object and data splits.

`processed/davidsdocked/`

The processed/davidsdocked/ directory contains the processed cross-docked pose-quality dataset. The main .pt files are PyTorch/PyTorch Geometric objects generated from the cross-docked structures and used to train the pose-quality prediction objective.

The filter_predicted_rmsd_le10.00/ subdirectory contains the processed pose-quality dataset after applying the RMSD-based filtering used for the paper-release experiments. This folder includes the corresponding processed dataset object and data splits.

Included data splits

The processed folders include split directories used for model training and evaluation, including:

random-k-fold/
original-scaffold-k-fold/ for the activity dataset
original-scaffold-k-fold/ for the pose-quality dataset

These splits were used to train and evaluate the multi-objective model while keeping the activity and pose-quality objectives aligned across training, validation, and test partitions.

How to use

Download the compressed archive from Zenodo.
Extract it into the root directory of the multi-objective-kinodata-3D repository:

                       tar -xzf <archive_name>.tar.gz

After extraction, the repository should contain a data/ directory with the raw/ and processed/ subfolders described above.

Install the repository following the instructions in the Github repository README.
Use the processed .pt files and split folders for training and evaluation of the multi-objective model.

Related software

The code used to process these datasets, train the multi-objective model, and reproduce the experiments is available at:

https://github.com/openkinome/multi-objective-kinodata-3D

For reproducibility, please use the tagged paper release associated with this dataset.

Notes

The processed datasets are provided for direct use with the training and evaluation scripts in the associated GitHub repository. They are included to support reproducibility of the experiments reported in the manuscript.

The raw files are provided to support transparency and allow users to reproduce or modify the dataset construction pipeline. When the data-curation code in the GitHub repository is run with user-defined settings, the corresponding processed/ directory is generated automatically. Therefore, users who wish to reproduce the exact paper experiments can use the processed datasets directly, while users who wish to apply different filtering, splitting, or curation settings can regenerate the processed files from the raw data.

Files

Files (2.5 GB)

Name	Size	Download all
processed_datasets_multiobjective_kinodata_10rmsd.tar.gz md5:4a487b44ce1bc08ad0b175767e2d8e59	2.2 GB	Download
raw_datasets_multiobjective_kinodata.tar.gz md5:0fb9f4bdd9aa2bc027ae49148e90f8bf	316.7 MB	Download

Additional details

Repository URL: https://github.com/openkinome/multi-objective-kinodata-3D

	All versions	This version
Views	39	39
Downloads	6	6
Data volume	15.0 GB	15.0 GB

Multi-objective Kinodata3D: Activity and Pose-Quality Datasets for Kinase–Ligand Binding Affinity Prediction

Description

Folder structure

Contents

`raw/`

`processed/kinodata/`

`processed/davidsdocked/`

Included data splits

How to use

Related software

Notes

Files (2.5 GB)

Software

Multi-objective Kinodata3D: Activity and Pose-Quality Datasets for Kinase–Ligand Binding Affinity Prediction

Authors/Creators

Description

Multi-objective Kinodata3D: Activity and Pose-Quality Datasets for Kinase–Ligand Binding Affinity Prediction

Description

Folder structure

Contents

raw/

processed/kinodata/

processed/davidsdocked/

Included data splits

How to use

Related software

Notes

Files

Files (2.5 GB)

Additional details

Software

`raw/`

`processed/kinodata/`

`processed/davidsdocked/`