PROTAC-Splitter: A Machine Learning Framework for Automated Identification of PROTAC Substructures
Authors/Creators
Description
PROTAC-Splitter: A Machine Learning Framework for Automated Identification of PROTAC Substructures
Overview
Proteolysis-targeting chimeras (PROTACs) are heterobifunctional molecules composed of an E3 ligase ligand, a linker, and a warhead targeting a protein of interest. Despite their modular structure, accurately identifying and annotating these components in PROTACs is challenging and typically relies on manual curation and predefined substructure matching.
To address this, we developed PROTAC-Splitter, a machine learning framework designed for automated annotation of PROTAC substructures.
This repository contains the datasets and pre-trained models necessary to replicate our work and apply PROTAC-Splitter to custom data, as reported in the publication:
“PROTAC-Splitter: A Machine Learning Framework for Automated Identification of PROTAC Substructures” by Stefano Ribes, Ranxuan Zhang, Télio Cropsal, Anders Källberg, Christian Tyrchan, Eva Nittinger, Rocío Mercado
The accompanying code is available at: https://github.com/ribesstefano/PROTAC-Splitter.
Data Available
- Synthetic Dataset: A dataset containing approximately 1.3 million PROTAC structures with annotated ligand splits.
- Transformer Model: A sequence-to-sequence model achieving high exact-match accuracy (86%) on public data.
- XGBoost Model: An XGBoost model trained on graph features ensuring chemical validity and perfect reassembly accuracy.
File Descriptions
Datasets used for training, testing, and validating the models:
- SMILES-based*:
- dataset-curated-held-out.csv
- dataset-synthetic-test.csv
- dataset-synthetic-train.csv
- dataset-synthetic-validation.csv
- Graph-based:
- dataset-graph-based-test.csv
- dataset-graph-based-train.csv
- dataset-graph-based-validation.csv
*NOTE: For the full list of generated synthetic PROTACs SMILES please refer to the SMILES-based datasets, specifically to the "PROTAC SMILES" column of the CSV files: dataset-synthetic-test.csv, dataset-synthetic-train.csv, and dataset-synthetic-validation.csv. The graph-based datasets are only used to train and evaluate the XGBoost model.
Pre-trained models:
- PROTAC-Splitter-Transformer.zip: Transformer-based sequence-to-sequence model.
- PROTAC-Splitter-XGBoost.joblib: Graph-based XGBoost model.
Original PROTAC data used for data curation can be obtained from:
- PROTAC-DB-v3: http://cadd.zju.edu.cn/protacdb/.
- PROTACpedia: https://protacpedia.weizmann.ac.il/ptcb/main.
License
This repository is open-source and available under the MIT License.
Contact Information
For questions or feedback, please contact Rocío Mercado: rocio.mercado@chalmers.se.
Files
Datasets.zip
Additional details
Funding
- Knut and Alice Wallenberg Foundation
Software
- Repository URL
- https://github.com/ribesstefano/PROTAC-Splitter
- Programming language
- Python