The path to my input files is ../../InputData/ThesaurusData/SearchResults/PS.
Parse the method (search engine and validation strategy) from folder names:
I manually add the information of which search engine has been used for the search.
I keep only te PSMs with a phosphorylation.
For the localisation scores of the search engines, I use the column “D score”. There is one score per PSM so if the PSM has several phosphorylations, I consider that they all have the same score. To find the positions of the phosphorylations I use the field “Annotated sequence”.
I also parse the localisation scores from the ptmRS adn Ascore algorithm. Both are returned in the column “Probabilistic.PTM.score”.
Comment: Some Localisation confidences are “Not Scored”, or “Random”. I don’t take these into consideration.
There are empty fields in the “Probabilistic.PTM.score” field for the Ascore: 8425, 8463, 7340for the pipelines PeptideShacker Comet Ascore TargetDecoy, PeptideShacker MSAmanda Ascore TargetDecoy, PeptideShacker X!Tandem Ascore TargetDecoy, respectively. (For a total of 54392, 54392, 48269, 48269, 47734, 47734, for PeptideShacker Comet Ascore TargetDecoy, PeptideShacker Comet PhosphoRS TargetDecoy, PeptideShacker MSAmanda Ascore TargetDecoy, PeptideShacker MSAmanda PhosphoRS TargetDecoy, PeptideShacker X!Tandem Ascore TargetDecoy, PeptideShacker X!Tandem PhosphoRS TargetDecoy, respectively).
Distribution of the ptmRS scores:
Distribution of the Ascores:
*The Ascore is the only scoring scheme tested in this study that does not range between 0 and 1 or 100.
Distribution of the search engine scores:
apply the threshold of localisation score : * above 0.75. The data are not filtered yet, I indicate if the localisation score passes the threshold in the field LocalisationsFilter. * above 20. The data are not filtered yet for Ascore.
For all the different inputs, I create IDs of the phospho-peptides:
PhosphopeptideID: concatenation of sequence and sorted localisation of the phosphorylation (seperated with "_").PhosphosequenceID: concatenation of sequence and number of phosphorylations on the peptide (seperated with "_").When there are several scores for the phosphorylations localisations, I create one ID for each scoring. I define the scorings as “ptmRS” when it is the phosphoRS algorithm, or “SearchEngine” when it is the default localisation score of the pipeline.
So in the end, the tables contains two rows per PSM, one with each localisation scoring scheme.
I remove the PSMs with empty Ascores.