The path to my input files is ../../InputData/ThesaurusData/SearchResults/PD.
Parse the method (search engine and validation strategy) from folder names:
I keep only the tables with ptmRS data.
To homogenise the outputs of different pipelines, I change “Amanda.Score” and “XCorr” to “Ions.Score”, “Apex.RT.in.min” to “RT.in.min”, and I remove the columns “Identity.Strict”, “Identity.Relaxed”, “Expectation.Value”, “Homology.Threshold”, “Peptides.Matched”, “Percolator.SVMScore”, “Percolator.q.Value”, “Search.Space”, “MS2.Errorin.ppm”, “MS.Amanda.Rank” (There are two rankings for MS Amanda searches, I keep “Search.Engine.Rank”).
I manually add the information of which search engine has been used for the search.
I keep only the high confidence PSM.
I keep only the rank 1 PSM.
I keep only te PSMs with a phosphorylation.
Some sequences from fragment proteins are identified with a “x” on N-ter that indicates that this does not correspond to the start of the protein. It leads to weird wrongly determined phosphorylations (on a non-existing “x” amino-acid) that I remove.
Distribution of the ptmRS scores:
Distribution of the delta scores:
I duplicate the rows to get for each PSM one row with the ptmRS localisation and an other one with the delta score.
apply the threshold of localisation score : above 0.75. The data are not filtered yet, I indicate if the localisation score passes the threshold in the field LocalisationsFilter.
For all the different inputs, I create IDs of the phospho-peptides:
PhosphopeptideID: concatenation of sequence and sorted localisation of the phosphorylation (seperated with "_").PhosphosequenceID: concatenation of sequence and number of phosphorylations on the peptide (seperated with "_").When there are several scores for the phosphorylations localisations, I create one ID for each scoring. I define the scorings as “ptmRS” when it is the phosphoRS algorithm, or “SearchEngine” when it is the default localisation score of the pipeline.