Published July 3, 2025 | Version v1.0.0
Journal article Open

PROTAC-Splitter: A Machine Learning Framework for Automated Identification of PROTAC Substructures

Description

PROTAC-Splitter: A Machine Learning Framework for Automated Identification of PROTAC Substructures

Overview

Proteolysis-targeting chimeras (PROTACs) are heterobifunctional molecules composed of an E3 ligase ligand, a linker, and a warhead targeting a protein of interest. Despite their modular structure, accurately identifying and annotating these components in PROTACs is challenging and typically relies on manual curation and predefined substructure matching.

To address this, we developed PROTAC-Splitter, a machine learning framework designed for automated annotation of PROTAC substructures.

This repository contains the datasets and pre-trained models necessary to replicate our work and apply PROTAC-Splitter to custom data, as reported in the publication:

“PROTAC-Splitter: A Machine Learning Framework for Automated Identification of PROTAC Substructures” by Stefano Ribes, Ranxuan Zhang, Télio Cropsal, Anders Källberg, Christian Tyrchan, Eva Nittinger, Rocío Mercado

The accompanying code is available at: https://github.com/ribesstefano/PROTAC-Splitter.

Data Available

  • Synthetic Dataset: A dataset containing approximately 1.3 million PROTAC structures with annotated ligand splits.
  • Transformer Model: A sequence-to-sequence model achieving high exact-match accuracy (86%) on public data.
  • XGBoost Model: An XGBoost model trained on graph features ensuring chemical validity and perfect reassembly accuracy.

File Descriptions

Datasets used for training, testing, and validating the models:

  • SMILES-based*:
    • dataset-curated-held-out.csv
    • dataset-synthetic-test.csv
    • dataset-synthetic-train.csv
    • dataset-synthetic-validation.csv
  • Graph-based:
    • dataset-graph-based-test.csv
    • dataset-graph-based-train.csv
    • dataset-graph-based-validation.csv

*NOTE: For the full list of generated synthetic PROTACs SMILES please refer to the SMILES-based datasets, specifically to the "PROTAC SMILES" column of the CSV files: dataset-synthetic-test.csvdataset-synthetic-train.csv, and dataset-synthetic-validation.csv. The graph-based datasets are only used to train and evaluate the XGBoost model.

Pre-trained models:

  • PROTAC-Splitter-Transformer.zip: Transformer-based sequence-to-sequence model.
  • PROTAC-Splitter-XGBoost.joblib: Graph-based XGBoost model.

Original PROTAC data used for data curation can be obtained from:

License

This repository is open-source and available under the MIT License.

Contact Information

For questions or feedback, please contact Rocío Mercado: rocio.mercado@chalmers.se.

Files

Datasets.zip

Files (621.2 MB)

Name Size Download all
md5:62ff51460e75b857461328e1d54c97b9
223.6 MB Preview Download
md5:2c84a6aaa52e30ff2aef63ea102589ce
380.2 MB Preview Download
md5:574b3913382ba371d0412b14dd32edef
17.4 MB Download

Additional details

Funding

Knut and Alice Wallenberg Foundation

Software

Repository URL
https://github.com/ribesstefano/PROTAC-Splitter
Programming language
Python