Published August 4, 2025 | Version v2
Dataset Restricted

CLASPP Training Testing Validation data (both initial prototype data and finalized data)

  • 1. ROR icon University of Georgia

Description

 

There are 2 versions of data sets included in this repo: the prototype (24_05_07) and finalized (25_05_11) 

The major difference between Prototype and the finalized data sets are as following  

  1. Prototype has more unsupervised custering labels (60) then finalized (54) 
  2. Prototype has more clusters associated with Y-Phos, K-Sumo K-Malo      
  3. Prototype has K-Succ and finalized does not  
  4. Finalized has PK-Hydr and Prototype does not
  5. They are put through the same curation pipeline but with different random seeds for sampling  

Most difference are an outcome of testing the stability of each PTM type (class) in a multi-classification setting. Most PTM (All that was tested) are stable (will converge) in a single binary classification setting. K-Succ when added to the multi-classification tanked its performance and other PTM types (anecdotal behavoir testing). The swap to a different finalized training set was primarily due to this clash in performance. In theory with different hyper-parameters, one could fix this. Difference 3 being the major reason and all other differences are just to tell a better story and cleaning up the data to coincide with benchmarks performance from Fig2 and S_Fig2. 

    

      

 Finalized (25_05_11) 

  • train_hd3_CustBL62SeqDistSpecClus_uniResRatio_1to60NegRaio-25_05_11.csv 
    • Used to train the final model and this was utilized in Fig4, Fig5, Fig6, S_Fig3, and S_fig4 
  • val_hd3_CustBL62SeqDistSpecClus_uniResRatio_1to1NegRaio-25_05_11.csv  
    • Used to help train the final model  
  • test_hd3_CustBL62SeqDistSpecClus_uniResRatio_1to1NegRaio-25_05_11.csv  
    • Used as the benchmark in Fig4 and S_Fig3 
    • Fig 4 and S_Fig3 used the positive labels and just neg labels that share the same res type as the negative class in this benchmark 
    • Same positive labels as HUMAN_labs.txt

All data in the finalized data set are labeled using the unsupervised clustering labels (54) rather than final labels (20)  

 


      

Prototype (24_05_07)   

  • train_hd3_CustBL62SeqDistSpecClus_uniResRatio_1to60NegRaio-24_05_07.csv  
    • Used to train initial model that was used to in Fig2, Fig3, S_Fig1, S_Fig2 
    • segmented and made a Singel Binary Classification models for Fig2 and S_Fig2 
  • val_hd3_CustBL62SeqDistSpecClus_uniResRatio_1to1NegRaio-24_05_07.csv  
    • Used to help train initial model and used to benchmark Fig2, Fig3, S_Fig1, S_Fig2 
    • Fig3 and S_Fig1 used the positive labels and all negative labels negative class in this benchmark 
    • Fig2 and S_Fig2 used the positive labels and all negative labels negative class in this benchmark (Singel Binary Classification) 

All data in the prototype data set are labeled using the unsupervised clustering labels (60) rather than final labels (20)  

 

 

 

Out-of-distribution  

  •     HUMAN_labs.txt 
  •     MOUSE_labs.txt 
  •     DROME_labs.txt 
  •     CAEEL_labs.txt 
  •     YEAST_labs.txt 
  •     ECOLI_labs.txt  

    All benchmarks here use final labels (20) rather than the unsupervised clustering labels (54)  

    Negative labels were only used if they share the residue to the positive label 

    All positive and negative labels were under-sampled to have a max of 500 

    Each species specific PTM type benchmark was only used if they have at least 100 positive examples and 50 negatives  

  

 

 

 Different Res and Neg ratios

(not good performance in practice but could work in theory)  

  •     train_hd3_CustBL62SeqDistSpecClus_uniResRatio_1to70NegRaio-24_05_07.csv  
    •     Pos/Neg ratio is 1/70 (each class) - Medium/Easy neg Res ration is uniform 1/20 - tot res ratio NOT uniform 
  •     train_hd3_CustBL62SeqDistSpecClus_uniResRatio_1to100NegRaio-24_05_07.csv  
    •     Pos/Neg ratio is 1/100 (each class) - Medium/Easy neg Res ration is uniform 1/20 - tot res ratio NOT uniform 
  •     train_hd3_CustBL62SeqDistSpecClus_uniResRatio_CustNegRaio-24_05_07.csv  
    •     Pos/Neg ratio is 1/1000 (each class) - Medium/Easy neg Res ration is NOT uniform - tot res ratio is uniform 

 

 

Other related repos

Repo Link (will go live when submitted) Discription
GitHub github_version_Data_cur This verstion contains code but but no data. It needs you to run the code to generate all the helper-files (will take some time run this code)
GitHub github_version_Forward This verstion contains code but NOT any weights (file too big for github)
Huggingface huggingface_version_Forward This verstion contains code and training weights
Zenodo zenodo_version_training_data zenodo version of training/testing/validation data
webtool webtool webtool hosted on a server

Files

Restricted

The record is publicly accessible, but files are restricted to users with access.

Additional details

Identifiers

Other
NA

Related works

Is part of
Dataset: NA (Other)

Dates

Created
2024-05-07
When the first verstion was created

References

  • NA