TACK: A statistical evaluation of degradation activity on a novel TArgeting Chimeras Knowledge dataset
Authors/Creators
Description
Overview
Proteolysis-targeting chimeras (PROTACs) are bifunctional molecules that recruit an E3 ubiquitin ligase to a target protein of interest (POI), directing it for proteasomal degradation. Predicting the extent and potency of PROTAC-induced degradation, quantified as Dmax (maximal degradation) and DC50 (concentration at half-maximal degradation) is a key challenge in targeted protein degradation drug design.
To address this, we developed TACK (TArgeting Chimeras Knowledge), a statistical machine learning framework trained and evaluated on the largest publicly available curated dataset of PROTAC degradation measurements. TACK integrates data from three sources, PROTAC-DB, PROTACpedia, and TPDdb, and trains MLP and XGBoost models under a rigorous repeated 5×5 cross-validation scheme. Ensemble models are constructed via Caruana's greedy forward selection method.
This repository contains the pre-trained ensemble models and ensemble weight specifications required to reproduce the results reported in:
"TACK: A statistical evaluation of degradation activity on a novel TArgeting Chimeras Knowledge dataset" by Stefano Ribes, Nils Dunlop, and Rocío Mercado
The curated dataset is available on Hugging Face at: https://huggingface.co/datasets/ailab-bio/TACK
The accompanying code is available at: https://github.com/ribesstefano/TACK
Data Available
- Ensemble Models: The trained models as checkpoint files, organized in directories for ensemble prediction of degradation activity (bin), Dmax, and DC50.
- Prediction Values: The evaluation results from running the trained models.
- Cache: Additional data to reproduce training and run the ensemble models.
File Descriptions
Checkpoints of ensemble models:
- bin_best_arch_ensemble: Ensemble of 25 models of the best architecture for predicting binary degradation activity.
- bin_caruana_ensemble: Ensemble of 18 models selected via Caruana selection for predicting binary degradation activity.
- dmax_best_arch_ensemble: Ensemble of 25 models of the best architecture for predicting maximum degradation activity (Dmax).
- dmax_caruana_ensemble: Ensemble of 33 models selected via Caruana selection for predicting maximum degradation activity (Dmax).
- dc50_best_arch_ensemble: Ensemble of 25 models of the best architecture for predicting DC50.
- dc50_caruana_ensemble: Ensemble of 22 models selected via Caruana selection for predicting DC50.
Evaluation results:
- predictions/: Contains the prediction files of all 5×5 trained models. The predictions are on the models’ respective validation fold and the common hold-out set. The predictions are per sample and stored as CSV files.
- predictions_protac_stan/: Contains the checkpoints and binary predictions of the PROTAC-STAN model trained on the TACK dataset.
Cached data generated during training:
- cell2cell_id.json: Mapping from cell line names to unique cell IDs in CelloSaurus.
- cell2description.json: Mapping from cell line names to their aggregated textual descriptions (e.g., tissue of origin, disease state).
- cell2data.json: Mapping from cell line names to information from CelloSaurus, stored in JSON format.
- cell_embeddings_model=sentence-transformer_pooling=sum.npz: Precomputed cell line embeddings using a sentence transformer model. The file contains a mapping from cell line IDs to their corresponding embedding vectors.
- morgan_fp_radius16_size512.npz: Mapping from SMILES strings to their corresponding Morgan fingerprints with radius 16 and size 512.
- rdkit_descriptors.npz: Mapping from SMILES strings to their corresponding RDKit descriptors (of size 217).
Original PROTAC data used for data curation can be obtained from:
- PROTAC-DB-v3: http://cadd.zju.edu.cn/protacdb/.
- PROTACpedia: https://protacpedia.weizmann.ac.il/ptcb/main.
- TPDdb: https://tpddb.idrblab.net.
License
This repository is open-source and available under the MIT License.
Contact Information
For questions or feedback, please contact Rocío Mercado: rocio.mercado@chalmers.se.
Files
cache.zip
Additional details
Funding
- Chalmers University of Technology
- Chalmers Gender Initiative for Excellence (Genie)
- Knut and Alice Wallenberg Foundation
- WASP
- Swedish Research Council
- Vetenskapsrådet
Dates
- Accepted
-
2026-05-17Accepted to Knowledge Discovery and Data Mining - KDD '26
Software
- Repository URL
- https://github.com/ribesstefano/TACK
- Programming language
- Python
- Development Status
- Active