Published May 17, 2025 | Version v1.0.0

TACK: A statistical evaluation of degradation activity on a novel TArgeting Chimeras Knowledge dataset

  • 1. ROR icon Chalmers University of Technology

Description

Overview

Proteolysis-targeting chimeras (PROTACs) are bifunctional molecules that recruit an E3 ubiquitin ligase to a target protein of interest (POI), directing it for proteasomal degradation. Predicting the extent and potency of PROTAC-induced degradation, quantified as Dmax (maximal degradation) and DC50 (concentration at half-maximal degradation) is a key challenge in targeted protein degradation drug design.

To address this, we developed TACK (TArgeting Chimeras Knowledge), a statistical machine learning framework trained and evaluated on the largest publicly available curated dataset of PROTAC degradation measurements. TACK integrates data from three sources, PROTAC-DB, PROTACpedia, and TPDdb, and trains MLP and XGBoost models under a rigorous repeated 5×5 cross-validation scheme. Ensemble models are constructed via Caruana's greedy forward selection method.

This repository contains the pre-trained ensemble models and ensemble weight specifications required to reproduce the results reported in:

"TACK: A statistical evaluation of degradation activity on a novel TArgeting Chimeras Knowledge dataset" by Stefano Ribes, Nils Dunlop, and Rocío Mercado

The curated dataset is available on Hugging Face at: https://huggingface.co/datasets/ailab-bio/TACK

The accompanying code is available at: https://github.com/ribesstefano/TACK

Data Available

  • Ensemble Models: The trained models as checkpoint files, organized in directories for ensemble prediction of degradation activity (bin), Dmax, and DC50.
  • Prediction Values: The evaluation results from running the trained models.
  • Cache: Additional data to reproduce training and run the ensemble models.

File Descriptions

Checkpoints of ensemble models:

  • bin_best_arch_ensemble: Ensemble of 25 models of the best architecture for predicting binary degradation activity.
  • bin_caruana_ensemble: Ensemble of 18 models selected via Caruana selection for predicting binary degradation activity.
  • dmax_best_arch_ensemble: Ensemble of 25 models of the best architecture for predicting maximum degradation activity (Dmax).
  • dmax_caruana_ensemble: Ensemble of 33 models selected via Caruana selection for predicting maximum degradation activity (Dmax).
  • dc50_best_arch_ensemble: Ensemble of 25 models of the best architecture for predicting DC50.
  • dc50_caruana_ensemble: Ensemble of 22 models selected via Caruana selection for predicting DC50.

Evaluation results:

  • predictions/: Contains the prediction files of all 5×5 trained models. The predictions are on the models’ respective validation fold and the common hold-out set. The predictions are per sample and stored as CSV files.
  • predictions_protac_stan/: Contains the checkpoints and binary predictions of the PROTAC-STAN model trained on the TACK dataset.

Cached data generated during training:

  • cell2cell_id.json: Mapping from cell line names to unique cell IDs in CelloSaurus.
  • cell2description.json: Mapping from cell line names to their aggregated textual descriptions (e.g., tissue of origin, disease state).
  • cell2data.json: Mapping from cell line names to information from CelloSaurus, stored in JSON format.
  • cell_embeddings_model=sentence-transformer_pooling=sum.npz: Precomputed cell line embeddings using a sentence transformer model. The file contains a mapping from cell line IDs to their corresponding embedding vectors.
  • morgan_fp_radius16_size512.npz: Mapping from SMILES strings to their corresponding Morgan fingerprints with radius 16 and size 512.
  • rdkit_descriptors.npz: Mapping from SMILES strings to their corresponding RDKit descriptors (of size 217).

Original PROTAC data used for data curation can be obtained from:

License

This repository is open-source and available under the MIT License.

Contact Information

For questions or feedback, please contact Rocío Mercado: rocio.mercado@chalmers.se.

Files

cache.zip

Files (1.0 GB)

Name Size
md5:ba4f18ff6bfbd3d87a26973fb566c7c9
29.3 MB Preview Download
md5:162afcddc5a83129950f1c3b035d8cac
574.4 MB Preview Download
md5:dce508b5d998ac9a1699c71f8eeac252
213.1 MB Preview Download
md5:5fee00a8932c21d0d94bd8cf65ab1b0f
200.1 MB Preview Download

Additional details

Funding

Chalmers University of Technology
Chalmers Gender Initiative for Excellence (Genie)
Knut and Alice Wallenberg Foundation
WASP
Swedish Research Council
Vetenskapsrådet

Dates

Accepted
2026-05-17
Accepted to Knowledge Discovery and Data Mining - KDD '26

Software

Repository URL
https://github.com/ribesstefano/TACK
Programming language
Python
Development Status
Active