Path to the search results parsed with this script: ../../InputData/FerriesEtAl/SearchResults/PD/.
Parse the method (search engine and validation strategy) from folder names:
I homogenize the column names (these can change depending on the search engine):
And combine the tables in one. Rq: I do not know why the fields “Search.Engine.Rank” and “MS.Amanda.Rank” don’t have the same values in the PDPSM_MSAmanda table. For the moment I remove the later.
I manually add the information of which search engine has been used for the search.
I keep only the phosphorylated ions.
I keep only the high confidence PSM.
I keep only the rank 1.
I keep only the files corresponding to the following acquisition methods: HCDOT.
For the localisation scores of the search engines, I use the column “Delta score”. There is one score per PSM so if the PSM has several phosphorylations, I consider that they all have the same score. To find the positions of the phosphorylations I use the field “Annotated sequence”.
I parse the localisation scores from the ptmRS algorithm. In proteome discoverer the search engine localisation scoring is not returned.
Distribution of the ptmRS scores:
Distribution of the search engine scores:
apply the threshold of localisation score : above 0.75. The data are not filtered yet, I indicate if the localisation score passes the threshold in the field LocalisationsFilter.
For all the different inputs, I create IDs of the phospho-peptides:
PhosphopeptideID: concatenation of pool, sequence and localisation of the phosphorylation (seperated with "_").PhosphosequenceID: concatenation of pool, sequence and number of phosphorylations on the peptide (seperated with "_").When there are several scores for the phosphorylations localisations, I create one ID for each scoring. I define the scorings as “ptmRS” when it is the phosphoRS algorithm, or “SearchEngine” when it is the default localisation score of the pipeline.
So in the end, the tables contains two rows per PSM, one with each localisation scoring scheme.