PROTAC-Splitter: A Machine Learning Framework for Automated Identification of PROTAC Substructures

Ribes, Stefano; Zhang, Ranxuan; Cropsal, Télio; Källberg, Anders; Tyrchan, Christian; Nittinger, Eva; Mercado Oropeza, Rocío

doi:10.5281/zenodo.15797310

Published July 3, 2025 | Version v1.0.0

Journal article Open

PROTAC-Splitter: A Machine Learning Framework for Automated Identification of PROTAC Substructures

1. Chalmers University of Technology
2. AstraZeneca

PROTAC-Splitter: A Machine Learning Framework for Automated Identification of PROTAC Substructures

Overview

Proteolysis-targeting chimeras (PROTACs) are heterobifunctional molecules composed of an E3 ligase ligand, a linker, and a warhead targeting a protein of interest. Despite their modular structure, accurately identifying and annotating these components in PROTACs is challenging and typically relies on manual curation and predefined substructure matching.

To address this, we developed PROTAC-Splitter, a machine learning framework designed for automated annotation of PROTAC substructures.

This repository contains the datasets and pre-trained models necessary to replicate our work and apply PROTAC-Splitter to custom data, as reported in the publication:

“PROTAC-Splitter: A Machine Learning Framework for Automated Identification of PROTAC Substructures” by Stefano Ribes, Ranxuan Zhang, Télio Cropsal, Anders Källberg, Christian Tyrchan, Eva Nittinger, Rocío Mercado

The accompanying code is available at: https://github.com/ribesstefano/PROTAC-Splitter.

Data Available

Synthetic Dataset: A dataset containing approximately 1.3 million PROTAC structures with annotated ligand splits.
Transformer Model: A sequence-to-sequence model achieving high exact-match accuracy (86%) on public data.
XGBoost Model: An XGBoost model trained on graph features ensuring chemical validity and perfect reassembly accuracy.

File Descriptions

Datasets used for training, testing, and validating the models:

SMILES-based*:
- dataset-curated-held-out.csv
- dataset-synthetic-test.csv
- dataset-synthetic-train.csv
- dataset-synthetic-validation.csv
Graph-based:
- dataset-graph-based-test.csv
- dataset-graph-based-train.csv
- dataset-graph-based-validation.csv

*NOTE: For the full list of generated synthetic PROTACs SMILES please refer to the SMILES-based datasets, specifically to the "PROTAC SMILES" column of the CSV files: dataset-synthetic-test.csv, dataset-synthetic-train.csv, and dataset-synthetic-validation.csv. The graph-based datasets are only used to train and evaluate the XGBoost model.

Pre-trained models:

PROTAC-Splitter-Transformer.zip: Transformer-based sequence-to-sequence model.
PROTAC-Splitter-XGBoost.joblib: Graph-based XGBoost model.

Original PROTAC data used for data curation can be obtained from:

PROTAC-DB-v3: http://cadd.zju.edu.cn/protacdb/.
PROTACpedia: https://protacpedia.weizmann.ac.il/ptcb/main.

License

This repository is open-source and available under the MIT License.

Contact Information

For questions or feedback, please contact Rocío Mercado: rocio.mercado@chalmers.se.

Files

Datasets.zip

Files (621.2 MB)

Name	Size	Download all
Datasets.zip md5:62ff51460e75b857461328e1d54c97b9	223.6 MB	Preview Download
PROTAC-Splitter-Transformer.zip md5:2c84a6aaa52e30ff2aef63ea102589ce	380.2 MB	Preview Download
PROTAC-Splitter-XGBoost.joblib md5:574b3913382ba371d0412b14dd32edef	17.4 MB	Download

Additional details

Knut and Alice Wallenberg Foundation

Repository URL: https://github.com/ribesstefano/PROTAC-Splitter
Programming language: Python

	All versions	This version
Views	516	516
Downloads	235	235
Data volume	45.8 GB	45.8 GB

PROTAC-Splitter: A Machine Learning Framework for Automated Identification of PROTAC Substructures

Overview

Data Available

File Descriptions

License

Contact Information

Datasets.zip

Files (621.2 MB)

Funding

Software

PROTAC-Splitter: A Machine Learning Framework for Automated Identification of PROTAC Substructures

Authors/Creators

Description

PROTAC-Splitter: A Machine Learning Framework for Automated Identification of PROTAC Substructures

Overview

Data Available

File Descriptions

License

Contact Information

Files

Datasets.zip

Files (621.2 MB)

Additional details

Funding

Software