CLASPP Training Testing Validation data (both initial prototype data and finalized data)

Gravel, Nathan

doi:10.5281/zenodo.16739128

Published August 4, 2025 | Version v2

Dataset Restricted

CLASPP Training Testing Validation data (both initial prototype data and finalized data)

Gravel, Nathan (Data curator)¹

1. University of Georgia

There are 2 versions of data sets included in this repo: the prototype (24_05_07) and finalized (25_05_11)

The major difference between Prototype and the finalized data sets are as following

Prototype has more unsupervised custering labels (60) then finalized (54)
Prototype has more clusters associated with Y-Phos, K-Sumo K-Malo
Prototype has K-Succ and finalized does not
Finalized has PK-Hydr and Prototype does not
They are put through the same curation pipeline but with different random seeds for sampling

Most difference are an outcome of testing the stability of each PTM type (class) in a multi-classification setting. Most PTM (All that was tested) are stable (will converge) in a single binary classification setting. K-Succ when added to the multi-classification tanked its performance and other PTM types (anecdotal behavoir testing). The swap to a different finalized training set was primarily due to this clash in performance. In theory with different hyper-parameters, one could fix this. Difference 3 being the major reason and all other differences are just to tell a better story and cleaning up the data to coincide with benchmarks performance from Fig2 and S_Fig2.

Finalized (25_05_11)

train_hd3_CustBL62SeqDistSpecClus_uniResRatio_1to60NegRaio-25_05_11.csv
- Used to train the final model and this was utilized in Fig4, Fig5, Fig6, S_Fig3, and S_fig4
val_hd3_CustBL62SeqDistSpecClus_uniResRatio_1to1NegRaio-25_05_11.csv
- Used to help train the final model
test_hd3_CustBL62SeqDistSpecClus_uniResRatio_1to1NegRaio-25_05_11.csv
- Used as the benchmark in Fig4 and S_Fig3
- Fig 4 and S_Fig3 used the positive labels and just neg labels that share the same res type as the negative class in this benchmark
- Same positive labels as HUMAN_labs.txt

All data in the finalized data set are labeled using the unsupervised clustering labels (54) rather than final labels (20)

Prototype (24_05_07)

train_hd3_CustBL62SeqDistSpecClus_uniResRatio_1to60NegRaio-24_05_07.csv
- Used to train initial model that was used to in Fig2, Fig3, S_Fig1, S_Fig2
- segmented and made a Singel Binary Classification models for Fig2 and S_Fig2
val_hd3_CustBL62SeqDistSpecClus_uniResRatio_1to1NegRaio-24_05_07.csv
- Used to help train initial model and used to benchmark Fig2, Fig3, S_Fig1, S_Fig2
- Fig3 and S_Fig1 used the positive labels and all negative labels negative class in this benchmark
- Fig2 and S_Fig2 used the positive labels and all negative labels negative class in this benchmark (Singel Binary Classification)

All data in the prototype data set are labeled using the unsupervised clustering labels (60) rather than final labels (20)

Out-of-distribution

HUMAN_labs.txt
MOUSE_labs.txt
DROME_labs.txt
CAEEL_labs.txt
YEAST_labs.txt
ECOLI_labs.txt

All benchmarks here use final labels (20) rather than the unsupervised clustering labels (54)

Negative labels were only used if they share the residue to the positive label

All positive and negative labels were under-sampled to have a max of 500

Each species specific PTM type benchmark was only used if they have at least 100 positive examples and 50 negatives

Different Res and Neg ratios

(not good performance in practice but could work in theory)

train_hd3_CustBL62SeqDistSpecClus_uniResRatio_1to70NegRaio-24_05_07.csv
- Pos/Neg ratio is 1/70 (each class) - Medium/Easy neg Res ration is uniform 1/20 - tot res ratio NOT uniform
train_hd3_CustBL62SeqDistSpecClus_uniResRatio_1to100NegRaio-24_05_07.csv
- Pos/Neg ratio is 1/100 (each class) - Medium/Easy neg Res ration is uniform 1/20 - tot res ratio NOT uniform
train_hd3_CustBL62SeqDistSpecClus_uniResRatio_CustNegRaio-24_05_07.csv
- Pos/Neg ratio is 1/1000 (each class) - Medium/Easy neg Res ration is NOT uniform - tot res ratio is uniform

Other related repos

Repo	Link (will go live when submitted)	Discription
GitHub	github_version_Data_cur	This verstion contains code but but no data. It needs you to run the code to generate all the helper-files (will take some time run this code)
GitHub	github_version_Forward	This verstion contains code but NOT any weights (file too big for github)
Huggingface	huggingface_version_Forward	This verstion contains code and training weights
Zenodo	zenodo_version_training_data	zenodo version of training/testing/validation data
webtool	webtool	webtool hosted on a server

Files

Restricted

The record is publicly accessible, but files are restricted to users with access.

Additional details

Other: NA

Is part of: Dataset: NA (Other)

Created: 2024-05-07

When the first verstion was created

NA

	All versions	This version
Views	23	21
Downloads	17	15
Data volume	561.5 MB	340.1 MB

CLASPP Training Testing Validation data (both initial prototype data and finalized data)

Finalized (25_05_11)

Prototype (24_05_07)

Out-of-distribution

Different Res and Neg ratios

Other related repos

Files

Restricted

Additional details

Identifiers

Related works

Dates

References

CLASPP Training Testing Validation data (both initial prototype data and finalized data)

Creators

Description

Finalized (25_05_11)

Prototype (24_05_07)

Out-of-distribution

Different Res and Neg ratios

Other related repos

Files

Restricted

Additional details

Identifiers

Related works

Dates

References