TACK: A statistical evaluation of degradation activity on a novel TArgeting Chimeras Knowledge dataset

Ribes, Stefano; Dunlop, Nils; Mercado Oropeza, Rocío

doi:10.5281/zenodo.15691822

Published May 17, 2025 | Version v1.0.0

Journal article Open

TACK: A statistical evaluation of degradation activity on a novel TArgeting Chimeras Knowledge dataset

1. Chalmers University of Technology

Overview

Proteolysis-targeting chimeras (PROTACs) are bifunctional molecules that recruit an E3 ubiquitin ligase to a target protein of interest (POI), directing it for proteasomal degradation. Predicting the extent and potency of PROTAC-induced degradation, quantified as D_max (maximal degradation) and DC₅₀ (concentration at half-maximal degradation) is a key challenge in targeted protein degradation drug design.

To address this, we developed TACK (TArgeting Chimeras Knowledge), a statistical machine learning framework trained and evaluated on the largest publicly available curated dataset of PROTAC degradation measurements. TACK integrates data from three sources, PROTAC-DB, PROTACpedia, and TPDdb, and trains MLP and XGBoost models under a rigorous repeated 5×5 cross-validation scheme. Ensemble models are constructed via Caruana's greedy forward selection method.

This repository contains the pre-trained ensemble models and ensemble weight specifications required to reproduce the results reported in:

"TACK: A statistical evaluation of degradation activity on a novel TArgeting Chimeras Knowledge dataset" by Stefano Ribes, Nils Dunlop, and Rocío Mercado

The curated dataset is available on Hugging Face at: https://huggingface.co/datasets/ailab-bio/TACK

The accompanying code is available at: https://github.com/ribesstefano/TACK

Data Available

Ensemble Models: The trained models as checkpoint files, organized in directories for ensemble prediction of degradation activity (bin), D_max, and DC₅₀.
Prediction Values: The evaluation results from running the trained models.
Cache: Additional data to reproduce training and run the ensemble models.

File Descriptions

Checkpoints of ensemble models:

bin_best_arch_ensemble: Ensemble of 25 models of the best architecture for predicting binary degradation activity.
bin_caruana_ensemble: Ensemble of 18 models selected via Caruana selection for predicting binary degradation activity.
dmax_best_arch_ensemble: Ensemble of 25 models of the best architecture for predicting maximum degradation activity (D_max).
dmax_caruana_ensemble: Ensemble of 33 models selected via Caruana selection for predicting maximum degradation activity (D_max).
dc50_best_arch_ensemble: Ensemble of 25 models of the best architecture for predicting DC50.
dc50_caruana_ensemble: Ensemble of 22 models selected via Caruana selection for predicting DC50.

Evaluation results:

predictions/: Contains the prediction files of all 5×5 trained models. The predictions are on the models’ respective validation fold and the common hold-out set. The predictions are per sample and stored as CSV files.
predictions_protac_stan/: Contains the checkpoints and binary predictions of the PROTAC-STAN model trained on the TACK dataset.

Cached data generated during training:

cell2cell_id.json: Mapping from cell line names to unique cell IDs in CelloSaurus.
cell2description.json: Mapping from cell line names to their aggregated textual descriptions (e.g., tissue of origin, disease state).
cell2data.json: Mapping from cell line names to information from CelloSaurus, stored in JSON format.
cell_embeddings_model=sentence-transformer_pooling=sum.npz: Precomputed cell line embeddings using a sentence transformer model. The file contains a mapping from cell line IDs to their corresponding embedding vectors.
morgan_fp_radius16_size512.npz: Mapping from SMILES strings to their corresponding Morgan fingerprints with radius 16 and size 512.
rdkit_descriptors.npz: Mapping from SMILES strings to their corresponding RDKit descriptors (of size 217).

Original PROTAC data used for data curation can be obtained from:

PROTAC-DB-v3: http://cadd.zju.edu.cn/protacdb/.
PROTACpedia: https://protacpedia.weizmann.ac.il/ptcb/main.
TPDdb: https://tpddb.idrblab.net.

License

This repository is open-source and available under the MIT License.

Contact Information

For questions or feedback, please contact Rocío Mercado: rocio.mercado@chalmers.se.

Files

cache.zip

Files (1.0 GB)

Name	Size
cache.zip md5:ba4f18ff6bfbd3d87a26973fb566c7c9	29.3 MB	Preview Download
ensembles.zip md5:162afcddc5a83129950f1c3b035d8cac	574.4 MB	Preview Download
predictions.zip md5:dce508b5d998ac9a1699c71f8eeac252	213.1 MB	Preview Download
predictions_protac_stan.zip md5:5fee00a8932c21d0d94bd8cf65ab1b0f	200.1 MB	Preview Download

Additional details

Chalmers University of Technology
Chalmers Gender Initiative for Excellence (Genie)
Knut and Alice Wallenberg Foundation
WASP
Swedish Research Council
Vetenskapsrådet

Accepted: 2026-05-17

Accepted to Knowledge Discovery and Data Mining - KDD '26

Repository URL: https://github.com/ribesstefano/TACK
Programming language: Python
Development Status: Active

	All versions	This version
Views	56	56
Downloads	59	59
Data volume	14.7 GB	14.7 GB

Overview

Data Available

File Descriptions

License

Contact Information

cache.zip

Files (1.0 GB)

Funding

Dates

Software

TACK: A statistical evaluation of degradation activity on a novel TArgeting Chimeras Knowledge dataset

Authors/Creators

Description

Overview

Data Available

File Descriptions

License

Contact Information

Files

cache.zip

Files (1.0 GB)

Additional details

Funding

Dates

Software