CLASPP Training Testing Validation data (both initial prototype data and finalized data)
Description
There are 2 versions of data sets included in this repo: the prototype (24_05_07) and finalized (25_05_11)
The major difference between Prototype and the finalized data sets are as following
- Prototype has more unsupervised custering labels (60) then finalized (54)
- Prototype has more clusters associated with Y-Phos, K-Sumo K-Malo
- Prototype has K-Succ and finalized does not
- Finalized has PK-Hydr and Prototype does not
- They are put through the same curation pipeline but with different random seeds for sampling
Most difference are an outcome of testing the stability of each PTM type (class) in a multi-classification setting. Most PTM (All that was tested) are stable (will converge) in a single binary classification setting. K-Succ when added to the multi-classification tanked its performance and other PTM types (anecdotal behavoir testing). The swap to a different finalized training set was primarily due to this clash in performance. In theory with different hyper-parameters, one could fix this. Difference 3 being the major reason and all other differences are just to tell a better story and cleaning up the data to coincide with benchmarks performance from Fig2 and S_Fig2.
Finalized (25_05_11)
- train_hd3_CustBL62SeqDistSpecClus_uniResRatio_1to60NegRaio-25_05_11.csv
- Used to train the final model and this was utilized in Fig4, Fig5, Fig6, S_Fig3, and S_fig4
- val_hd3_CustBL62SeqDistSpecClus_uniResRatio_1to1NegRaio-25_05_11.csv
- Used to help train the final model
- test_hd3_CustBL62SeqDistSpecClus_uniResRatio_1to1NegRaio-25_05_11.csv
- Used as the benchmark in Fig4 and S_Fig3
- Fig 4 and S_Fig3 used the positive labels and just neg labels that share the same res type as the negative class in this benchmark
- Same positive labels as HUMAN_labs.txt
All data in the finalized data set are labeled using the unsupervised clustering labels (54) rather than final labels (20)
Prototype (24_05_07)
- train_hd3_CustBL62SeqDistSpecClus_uniResRatio_1to60NegRaio-24_05_07.csv
- Used to train initial model that was used to in Fig2, Fig3, S_Fig1, S_Fig2
- segmented and made a Singel Binary Classification models for Fig2 and S_Fig2
- val_hd3_CustBL62SeqDistSpecClus_uniResRatio_1to1NegRaio-24_05_07.csv
- Used to help train initial model and used to benchmark Fig2, Fig3, S_Fig1, S_Fig2
- Fig3 and S_Fig1 used the positive labels and all negative labels negative class in this benchmark
- Fig2 and S_Fig2 used the positive labels and all negative labels negative class in this benchmark (Singel Binary Classification)
All data in the prototype data set are labeled using the unsupervised clustering labels (60) rather than final labels (20)
Out-of-distribution
- HUMAN_labs.txt
- MOUSE_labs.txt
- DROME_labs.txt
- CAEEL_labs.txt
- YEAST_labs.txt
- ECOLI_labs.txt
All benchmarks here use final labels (20) rather than the unsupervised clustering labels (54)
Negative labels were only used if they share the residue to the positive label
All positive and negative labels were under-sampled to have a max of 500
Each species specific PTM type benchmark was only used if they have at least 100 positive examples and 50 negatives
Different Res and Neg ratios
(not good performance in practice but could work in theory)
- train_hd3_CustBL62SeqDistSpecClus_uniResRatio_1to70NegRaio-24_05_07.csv
- Pos/Neg ratio is 1/70 (each class) - Medium/Easy neg Res ration is uniform 1/20 - tot res ratio NOT uniform
- train_hd3_CustBL62SeqDistSpecClus_uniResRatio_1to100NegRaio-24_05_07.csv
- Pos/Neg ratio is 1/100 (each class) - Medium/Easy neg Res ration is uniform 1/20 - tot res ratio NOT uniform
- train_hd3_CustBL62SeqDistSpecClus_uniResRatio_CustNegRaio-24_05_07.csv
- Pos/Neg ratio is 1/1000 (each class) - Medium/Easy neg Res ration is NOT uniform - tot res ratio is uniform
Other related repos
Repo | Link (will go live when submitted) | Discription |
GitHub | github_version_Data_cur | This verstion contains code but but no data. It needs you to run the code to generate all the helper-files (will take some time run this code) |
GitHub | github_version_Forward | This verstion contains code but NOT any weights (file too big for github) |
Huggingface | huggingface_version_Forward | This verstion contains code and training weights |
Zenodo | zenodo_version_training_data | zenodo version of training/testing/validation data |
webtool | webtool | webtool hosted on a server |
Files
Additional details
Identifiers
- Other
- NA
Related works
- Is part of
- Dataset: NA (Other)
Dates
- Created
-
2024-05-07When the first verstion was created
References
- NA