Published November 21, 2025 | Version v1
Dataset Open

CLASPP: A unified model for predicting post-translational modifications (data and benchmarks)

Authors/Creators

Description

There are 2 versions of data sets included in this repo: the prototype (24_05_07) and finalized (25_05_11) 

The major difference between Prototype and the finalized data sets are as following  

  1. Prototype has more unsupervised clustering labels (60) then finalized (54) 
    • Prototype has more clusters associated with Y-Phos, K-Sumo K-Malo      
    • Prototype has K-Succ and finalized does not  
    • Finalized has PK-Hydr and Prototype does not
  2. They are put through the same curation pipeline but with different random seeds for sampling  

Most of the differences are an outcome of testing the stability of each PTM type (class) in a multi-classification setting. Most PTM (All that was tested) are stable (will converge) in a single binary classification setting. The PTM type "K-Succ" when added to the multi-classification tanked its own performance and also other PTM types (anecdotal behavior testing). The swap to a different finalized training set was primarily due to this clash in performance. In theory with different hyper-parameters, one could fix this.  

There are 3 types of files used (benchmark, individual, and matrix). Benchmarks are constrained to residues of interest for both negative and positive data. “Individual” and “matrix” hold nearly identical data, but the “individual” datasets are the flattened version of the “matrix” file for easier data handling. The “matrix” version of the file is there for training and an easier way to keep track of peptides associated with multiple PTMs. The “uniprot_IDs_Pos” column for each file can have multiple locations associated and are listed out and separated by a “--”. The current mapping of the shared locations per peptide only holds true if you are looking at 21mers. Any increase context window size might break up the shared peptide associations, and the peptide might not be considered a multi-PTM event in this case. The uniprot IDs are from multiple older versions of uniprot and uniprot IDs can repeated but are denoted as such in the accession. Due to possible repeat uniprot IDs, different uniprot IDs are denoted with the extension of “-outOfDateV2” in the accession to maintain unique sequence mapping. 

 

Finalized Training, Testing, and Validation(25_05_11) 

  • individual_train_hd3_CustSeqDistSpecClus_1to60NegRaio-25_05_11.csv
  • matrix_train_hd3_CustSeqDistSpecClus_1to60NegRaio-25_05_11.csv 
    • Used to train the final model and this was utilized in all figures expect Fig2 and S2_Fig 

 

  • individual_val_hd3_CustSeqDistSpecClus_1to1NegRaio-25_05_11.csv
  • matrix_val_hd3_CustSeqDistSpecClus_1to1NegRaio-25_05_11.csv  
    • Used to help train the final model

 

  • individual_test_hd3_CustSeqDistSpecClus_1to1NegRaio-25_05_11.csv
  • matrix_test_hd3_CustSeqDistSpecClus_1to1NegRaio-25_05_11.csv 
    • Used as the benchmark in Fig4 and S_Fig4
    • Fig 4 and S_Fig4 used the positive labels and just neg labels that share the same res type as the negative class in this benchmark
    • Same positive labels as HUMAN_labs.txt

All data in the finalized data set are labeled using the unsupervised clustering labels (54) rather than final labels (20) for the matrix files

 

 

Finalized Benchmarks

  • benchmark_test_HUMAN_hd3_Phosphorylation(ST)-25_05_11.csv
  • benchmark_test_HUMAN_hd3_Phosphorylation(Y)-25_05_11.csv
  • benchmark_test_HUMAN_hd3_Ubiquitination-25_05_11.csv
  • benchmark_test_HUMAN_hd3_Acetylation(K)-25_05_11.csv
  • benchmark_test_HUMAN_hd3_Acetylation(AM)-25_05_11.csv
  • benchmark_test_HUMAN_hd3_N-linked-Glycosylation-25_05_11.csv
  • benchmark_test_HUMAN_hd3_O-linked-Glycosylation-25_05_11.csv
  • benchmark_test_HUMAN_hd3_Methylation-25_05_11.csv
  • benchmark_test_HUMAN_hd3_Sumoylation-25_05_11.csv
  • benchmark_test_HUMAN_hd3_Malonylation-25_05_11.csv
  • benchmark_test_HUMAN_hd3_Sulfoxidation-25_05_11.csv
  • benchmark_test_HUMAN_hd3_S-palmitoylation-25_05_11.csv
  • benchmark_test_HUMAN_hd3_Glutathionylation-25_05_11.csv
  • benchmark_test_HUMAN_hd3_Hydroxylation-25_05_11.csv
    • All benchmarks are filtered for the residue of interest from individual_test_hd3_CustSeqDistSpecClus_1to1NegRaio-25_05_11.csv

 

  • out_of_distribution.zip
    • -- YEAST/
      • --/YEAST/benchmark_YEAST_S_Phosphorylation-25_05_11.csv
      • --/YEAST/benchmark_YEAST_T_Phosphorylation-25_05_11.csv
      • --/YEAST/benchmark_YEAST_Y_Phosphorylation-25_05_11.csv
      • --/YEAST/benchmark_YEAST_K_Ubiquitination-25_05_11.csv
      • --/YEAST/...
    • -- MOUSE/
      • --/MOUSE/...
    • -- ECOLI/
      • --/ECOLI//...
    • -- DROME/
      • --/DROME/...
    • -- CAEEL/
      • --/CAEEL/...
    • Each species has similar files to the human Finalized Benchmarks but they are residue specific.
    • This is separated by organism. Each benchmark is also separated into PTM type and residue of interest. Due note that some of the benchmarks do not have enough data to be accurate   
    • Negative labels were other PTMs types and are only used if they share the residue(s) of interest for the PTM 
    • All positive and negative labels were under-sampled to have a max of 500 
    • Each species specific PTM type benchmark was only used if they have at least 100 positive examples and 50 negatives 
    • Used in Fig. 5 and S6 Fig.

 

 

Prototype (24_05_07)     

  • individual_train_hd3_CustSeqDistSpecClus_1to60NegRaio-24_05_07.csv
  • matrix_train_hd3_CustSeqDistSpecClus_1to60NegRaio-24_05_07.csv
    • Used to train initial model that was used to in Fig2, S_Fig2
    • segmented and made a Singel Binary Classification models for Fig2 and S_Fig2

 

  • individual_val_hd3_CustSeqDistSpecClus_1to1NegRaio-24_05_07.csv 
  • matrix_val_hd3_CustSeqDistSpecClus_1to1NegRaio-24_05_07.csv  
    • Used to help train initial model and used to benchmark Fig2, S_Fig2
    • Fig3 and S_Fig1 used the positive labels and all negative labels negative class in this benchmark
    • Fig2 and S_Fig2 used the positive labels and all negative labels negative class in this benchmark (Singel Binary Classification) 

All data in the prototype data set are labeled using the unsupervised clustering labels (60) rather than final labels (20) for the matrix files

 

 

Different Res and Neg ratios (24_05_07)

(not good performance in practice but could work in theory)  

  • individual_train_hd3_CustSeqDistSpecClus_1to70NegRaio-24_05_07.csv
  • matrix_train_hd3_CustSeqDistSpecClus_1to70NegRaio-24_05_07.csv
    • Pos/Neg ratio is 1/70 (each class) - Medium/Easy negative residue ratio is uniform 1/20 - total residue ratio NOT uniform 

 

  • individual_train_hd3_CustSeqDistSpecClus_1to100NegRaio-24_05_07.csv
  • matrix_train_hd3_CustSeqDistSpecClus_1to100NegRaio-24_05_07.csv
    • Pos/Neg ratio is 1/100 (each class) - Medium/Easy negative residue ratio is uniform 1/20 - total residue ratio NOT uniform

 

  • individual_train_hd3_CustSeqDistSpecClus_uniResRatio_CustNegRaio-24_05_07.csv 
  • matrix_train_hd3_CustSeqDistSpecClus_uniResRatio_CustNegRaio-24_05_07.csv
    • Pos/Neg ratio is 1/1000 (each class) - Medium/Easy negative residue ratio is NOT uniform - total residue ratio is uniform 

 

Mappable FASTA

--------------------

Here are the fasta files that can be used to get full sequence context. The uniprot IDs are from multiple older versions of uniprot and uniprot IDs can repeat but are denoted as such in the accession. Due to possible repeat uniprot IDs, different uniprot IDs are denoted with the extension of “-outOfDateV2” in the accession to maintain unique sequence mapping. 

  • uniprot_version_control_train_test_val_25_05_11.fasta
    • Used to map the "Finalized Training, Testing, and Validation(25_05_11)" and "Finalized Benchmarks" files. 
  • uniprot_version_control_train_test_val_24_05_07.fasta
    • Used to map the "Prototype (24_05_07)" and "Different Res and Neg ratios (24_05_07)" files.
  • uniprot_version_control_OOD_25_05_11.fasta
    • Used to map the "out_of_distribution.zip" files.

 

Files

benchmark_test_HUMAN_hd3_Acetylation(AM)-25_05_11.csv

Files (2.4 GB)

Name Size Download all
md5:ddffdc14274967fb0930422e747f6789
43.1 kB Preview Download
md5:f2d7ddf8113ea9ae5e6ccdb51e03bb6a
575.1 kB Preview Download
md5:44f4c6cc72bf14371ab35e395734cb00
32.8 kB Preview Download
md5:be69642a41e4ba08a3cabbfd47ff7a78
575.3 kB Preview Download
md5:115877148f2bae7fb2404fb8887d198d
558.0 kB Preview Download
md5:ce8385a5456468d9962bba4f677de7bf
594.6 kB Preview Download
md5:01ed10ec6fc8bb9091a8faaafbbb52ec
25.4 kB Preview Download
md5:3a1a389f6b34e5fb44ed0a9780b4a465
182.4 kB Preview Download
md5:49d1370f08a0cce6b947b8eb9ca0a2f6
174.3 kB Preview Download
md5:17ba4e7bbd429bd791066a7487b040ef
25.0 kB Preview Download
md5:f8be6b9d6a6b13a652d58d2b83b70120
32.3 kB Preview Download
md5:cad3cd31e20ff32f5201c9c6ca735934
28.7 kB Preview Download
md5:6078ab6b5f71df2424a249695056f5de
549.4 kB Preview Download
md5:cdc8d1e3d58f7a755714b6a5d2cd9892
575.1 kB Preview Download
md5:7aa2841d1e567503a3a45502a5ebdb82
16.1 MB Preview Download
md5:864ed6687d41846c7de73cd2c487ce87
122.0 MB Preview Download
md5:212a94c30c7cf6552642e478211621fe
68.0 MB Preview Download
md5:5afd72a53c25f0f87f10f79b73f77f13
68.1 MB Preview Download
md5:988ea2b59f439b6f8f0cb77ee4d5ffd4
81.5 MB Preview Download
md5:b6444ee65058daf29ce8423bf871d177
1.5 GB Preview Download
md5:4ee312ce9060c1e9ab42f3cb805ed300
16.0 MB Preview Download
md5:fa6dcffb3ff1e50bd8e1883a5ac4e41e
16.1 MB Preview Download
md5:84285642e54c6a0b98ce0d38f389d8ca
2.4 MB Preview Download
md5:f03ed302338bd8723b5b2ad2b20109f6
20.1 MB Preview Download
md5:6a497762df92b292d942ee9bd5e66463
11.1 MB Preview Download
md5:c5a4337a356f5aedd2bff412c9febf9a
9.9 MB Preview Download
md5:d5d96e487effe897c81fb2340ec76328
13.3 MB Preview Download
md5:7fe5622d4e3725b626d802e5a898359a
249.6 MB Preview Download
md5:7f4314264258fecfffb9dbe64494c4d4
2.6 MB Preview Download
md5:d8caa9544632846dea0dbe8cda96f6d0
2.4 MB Preview Download
md5:4a19a214d5a2155051fd3abdbe20f8f7
1.5 MB Preview Download
md5:60c6c47e7293e036a09d32e8125885ac
16.7 MB Download
md5:e8444f51989c26998bd0c778322a7b18
187.3 MB Download
md5:7b94743faaf02bb1e97e8f2236b76dbd
24.1 MB Download