CLASPP: A unified model for predicting post-translational modifications (data and benchmarks)
Authors/Creators
Description
There are 2 versions of data sets included in this repo: the prototype (24_05_07) and finalized (25_05_11)
The major difference between Prototype and the finalized data sets are as following
- Prototype has more unsupervised clustering labels (60) then finalized (54)
- Prototype has more clusters associated with Y-Phos, K-Sumo K-Malo
- Prototype has K-Succ and finalized does not
- Finalized has PK-Hydr and Prototype does not
- They are put through the same curation pipeline but with different random seeds for sampling
Most of the differences are an outcome of testing the stability of each PTM type (class) in a multi-classification setting. Most PTM (All that was tested) are stable (will converge) in a single binary classification setting. The PTM type "K-Succ" when added to the multi-classification tanked its own performance and also other PTM types (anecdotal behavior testing). The swap to a different finalized training set was primarily due to this clash in performance. In theory with different hyper-parameters, one could fix this.
There are 3 types of files used (benchmark, individual, and matrix). Benchmarks are constrained to residues of interest for both negative and positive data. “Individual” and “matrix” hold nearly identical data, but the “individual” datasets are the flattened version of the “matrix” file for easier data handling. The “matrix” version of the file is there for training and an easier way to keep track of peptides associated with multiple PTMs. The “uniprot_IDs_Pos” column for each file can have multiple locations associated and are listed out and separated by a “--”. The current mapping of the shared locations per peptide only holds true if you are looking at 21mers. Any increase context window size might break up the shared peptide associations, and the peptide might not be considered a multi-PTM event in this case. The uniprot IDs are from multiple older versions of uniprot and uniprot IDs can repeated but are denoted as such in the accession. Due to possible repeat uniprot IDs, different uniprot IDs are denoted with the extension of “-outOfDateV2” in the accession to maintain unique sequence mapping.
Finalized Training, Testing, and Validation(25_05_11)
- individual_train_hd3_CustSeqDistSpecClus_1to60NegRaio-25_05_11.csv
- matrix_train_hd3_CustSeqDistSpecClus_1to60NegRaio-25_05_11.csv
- Used to train the final model and this was utilized in all figures expect Fig2 and S2_Fig
- individual_val_hd3_CustSeqDistSpecClus_1to1NegRaio-25_05_11.csv
- matrix_val_hd3_CustSeqDistSpecClus_1to1NegRaio-25_05_11.csv
- Used to help train the final model
- individual_test_hd3_CustSeqDistSpecClus_1to1NegRaio-25_05_11.csv
- matrix_test_hd3_CustSeqDistSpecClus_1to1NegRaio-25_05_11.csv
- Used as the benchmark in Fig4 and S_Fig4
- Fig 4 and S_Fig4 used the positive labels and just neg labels that share the same res type as the negative class in this benchmark
- Same positive labels as HUMAN_labs.txt
All data in the finalized data set are labeled using the unsupervised clustering labels (54) rather than final labels (20) for the matrix files
Finalized Benchmarks
- benchmark_test_HUMAN_hd3_Phosphorylation(ST)-25_05_11.csv
- benchmark_test_HUMAN_hd3_Phosphorylation(Y)-25_05_11.csv
- benchmark_test_HUMAN_hd3_Ubiquitination-25_05_11.csv
- benchmark_test_HUMAN_hd3_Acetylation(K)-25_05_11.csv
- benchmark_test_HUMAN_hd3_Acetylation(AM)-25_05_11.csv
- benchmark_test_HUMAN_hd3_N-linked-Glycosylation-25_05_11.csv
- benchmark_test_HUMAN_hd3_O-linked-Glycosylation-25_05_11.csv
- benchmark_test_HUMAN_hd3_Methylation-25_05_11.csv
- benchmark_test_HUMAN_hd3_Sumoylation-25_05_11.csv
- benchmark_test_HUMAN_hd3_Malonylation-25_05_11.csv
- benchmark_test_HUMAN_hd3_Sulfoxidation-25_05_11.csv
- benchmark_test_HUMAN_hd3_S-palmitoylation-25_05_11.csv
- benchmark_test_HUMAN_hd3_Glutathionylation-25_05_11.csv
- benchmark_test_HUMAN_hd3_Hydroxylation-25_05_11.csv
- All benchmarks are filtered for the residue of interest from individual_test_hd3_CustSeqDistSpecClus_1to1NegRaio-25_05_11.csv
- out_of_distribution.zip
- -- YEAST/
- --/YEAST/benchmark_YEAST_S_Phosphorylation-25_05_11.csv
- --/YEAST/benchmark_YEAST_T_Phosphorylation-25_05_11.csv
- --/YEAST/benchmark_YEAST_Y_Phosphorylation-25_05_11.csv
- --/YEAST/benchmark_YEAST_K_Ubiquitination-25_05_11.csv
- --/YEAST/...
- -- MOUSE/
- --/MOUSE/...
- -- ECOLI/
- --/ECOLI//...
- -- DROME/
- --/DROME/...
- -- CAEEL/
- --/CAEEL/...
- Each species has similar files to the human Finalized Benchmarks but they are residue specific.
- This is separated by organism. Each benchmark is also separated into PTM type and residue of interest. Due note that some of the benchmarks do not have enough data to be accurate
- Negative labels were other PTMs types and are only used if they share the residue(s) of interest for the PTM
- All positive and negative labels were under-sampled to have a max of 500
- Each species specific PTM type benchmark was only used if they have at least 100 positive examples and 50 negatives
- Used in Fig. 5 and S6 Fig.
- -- YEAST/
Prototype (24_05_07)
- individual_train_hd3_CustSeqDistSpecClus_1to60NegRaio-24_05_07.csv
- matrix_train_hd3_CustSeqDistSpecClus_1to60NegRaio-24_05_07.csv
- Used to train initial model that was used to in Fig2, S_Fig2
- segmented and made a Singel Binary Classification models for Fig2 and S_Fig2
- individual_val_hd3_CustSeqDistSpecClus_1to1NegRaio-24_05_07.csv
- matrix_val_hd3_CustSeqDistSpecClus_1to1NegRaio-24_05_07.csv
- Used to help train initial model and used to benchmark Fig2, S_Fig2
- Fig3 and S_Fig1 used the positive labels and all negative labels negative class in this benchmark
- Fig2 and S_Fig2 used the positive labels and all negative labels negative class in this benchmark (Singel Binary Classification)
All data in the prototype data set are labeled using the unsupervised clustering labels (60) rather than final labels (20) for the matrix files
Different Res and Neg ratios (24_05_07)
(not good performance in practice but could work in theory)
- individual_train_hd3_CustSeqDistSpecClus_1to70NegRaio-24_05_07.csv
- matrix_train_hd3_CustSeqDistSpecClus_1to70NegRaio-24_05_07.csv
- Pos/Neg ratio is 1/70 (each class) - Medium/Easy negative residue ratio is uniform 1/20 - total residue ratio NOT uniform
- individual_train_hd3_CustSeqDistSpecClus_1to100NegRaio-24_05_07.csv
- matrix_train_hd3_CustSeqDistSpecClus_1to100NegRaio-24_05_07.csv
- Pos/Neg ratio is 1/100 (each class) - Medium/Easy negative residue ratio is uniform 1/20 - total residue ratio NOT uniform
- individual_train_hd3_CustSeqDistSpecClus_uniResRatio_CustNegRaio-24_05_07.csv
- matrix_train_hd3_CustSeqDistSpecClus_uniResRatio_CustNegRaio-24_05_07.csv
- Pos/Neg ratio is 1/1000 (each class) - Medium/Easy negative residue ratio is NOT uniform - total residue ratio is uniform
Mappable FASTA
--------------------
Here are the fasta files that can be used to get full sequence context. The uniprot IDs are from multiple older versions of uniprot and uniprot IDs can repeat but are denoted as such in the accession. Due to possible repeat uniprot IDs, different uniprot IDs are denoted with the extension of “-outOfDateV2” in the accession to maintain unique sequence mapping.
- uniprot_version_control_train_test_val_25_05_11.fasta
- Used to map the "Finalized Training, Testing, and Validation(25_05_11)" and "Finalized Benchmarks" files.
- uniprot_version_control_train_test_val_24_05_07.fasta
- Used to map the "Prototype (24_05_07)" and "Different Res and Neg ratios (24_05_07)" files.
- uniprot_version_control_OOD_25_05_11.fasta
- Used to map the "out_of_distribution.zip" files.
Files
benchmark_test_HUMAN_hd3_Acetylation(AM)-25_05_11.csv
Files
(2.4 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:ddffdc14274967fb0930422e747f6789
|
43.1 kB | Preview Download |
|
md5:f2d7ddf8113ea9ae5e6ccdb51e03bb6a
|
575.1 kB | Preview Download |
|
md5:44f4c6cc72bf14371ab35e395734cb00
|
32.8 kB | Preview Download |
|
md5:be69642a41e4ba08a3cabbfd47ff7a78
|
575.3 kB | Preview Download |
|
md5:115877148f2bae7fb2404fb8887d198d
|
558.0 kB | Preview Download |
|
md5:ce8385a5456468d9962bba4f677de7bf
|
594.6 kB | Preview Download |
|
md5:01ed10ec6fc8bb9091a8faaafbbb52ec
|
25.4 kB | Preview Download |
|
md5:3a1a389f6b34e5fb44ed0a9780b4a465
|
182.4 kB | Preview Download |
|
md5:49d1370f08a0cce6b947b8eb9ca0a2f6
|
174.3 kB | Preview Download |
|
md5:17ba4e7bbd429bd791066a7487b040ef
|
25.0 kB | Preview Download |
|
md5:f8be6b9d6a6b13a652d58d2b83b70120
|
32.3 kB | Preview Download |
|
md5:cad3cd31e20ff32f5201c9c6ca735934
|
28.7 kB | Preview Download |
|
md5:6078ab6b5f71df2424a249695056f5de
|
549.4 kB | Preview Download |
|
md5:cdc8d1e3d58f7a755714b6a5d2cd9892
|
575.1 kB | Preview Download |
|
md5:7aa2841d1e567503a3a45502a5ebdb82
|
16.1 MB | Preview Download |
|
md5:864ed6687d41846c7de73cd2c487ce87
|
122.0 MB | Preview Download |
|
md5:212a94c30c7cf6552642e478211621fe
|
68.0 MB | Preview Download |
|
md5:5afd72a53c25f0f87f10f79b73f77f13
|
68.1 MB | Preview Download |
|
md5:988ea2b59f439b6f8f0cb77ee4d5ffd4
|
81.5 MB | Preview Download |
|
md5:b6444ee65058daf29ce8423bf871d177
|
1.5 GB | Preview Download |
|
md5:4ee312ce9060c1e9ab42f3cb805ed300
|
16.0 MB | Preview Download |
|
md5:fa6dcffb3ff1e50bd8e1883a5ac4e41e
|
16.1 MB | Preview Download |
|
md5:84285642e54c6a0b98ce0d38f389d8ca
|
2.4 MB | Preview Download |
|
md5:f03ed302338bd8723b5b2ad2b20109f6
|
20.1 MB | Preview Download |
|
md5:6a497762df92b292d942ee9bd5e66463
|
11.1 MB | Preview Download |
|
md5:c5a4337a356f5aedd2bff412c9febf9a
|
9.9 MB | Preview Download |
|
md5:d5d96e487effe897c81fb2340ec76328
|
13.3 MB | Preview Download |
|
md5:7fe5622d4e3725b626d802e5a898359a
|
249.6 MB | Preview Download |
|
md5:7f4314264258fecfffb9dbe64494c4d4
|
2.6 MB | Preview Download |
|
md5:d8caa9544632846dea0dbe8cda96f6d0
|
2.4 MB | Preview Download |
|
md5:4a19a214d5a2155051fd3abdbe20f8f7
|
1.5 MB | Preview Download |
|
md5:60c6c47e7293e036a09d32e8125885ac
|
16.7 MB | Download |
|
md5:e8444f51989c26998bd0c778322a7b18
|
187.3 MB | Download |
|
md5:7b94743faaf02bb1e97e8f2236b76dbd
|
24.1 MB | Download |