How to compute consensus ORF profiles #178
Description
In the ORF dataset, for some genes, there are multiple ORF reagents targeting them. This means that, unlike the CRISPR dataset, computing consensus sequences by aggregating all samples with the same Metadata_JCP2022 will not work. Computing the consensus sequence is important whenever we need a gene-level profile.
I tested six scenarios to identify the ones that perform the best and also make the most biological sense.
s1: aggregate profiles byMetadata_JCP2022.s2: aggregate profiles byMetadata_NCBI_Gene_ID.s3: aggregate profiles byMetadata_JCP2022but useMetadata_NCBI_Gene_IDaspos_diffbyin copairs.s4: aggregate profiles byMetadata_JCP2022but keep the reagent with the highestmean_average_precisionfor phenotypic activity.s5: aggregate profiles byMetadata_JCP2022but keep a random ORF reagents6: aggregate profiles byMetadata_JCP2022but keep the reagent with the longest insert size and highest proteins matching %. If there still are multiple reagents, then randomly choose one reagent.
I compared two ORF profiles, the one from the ORF pipeline (ORF) and the one from the CRISPR pipeline (ORF-CRISPR pipeline) and retrieved two gene labels, CORUM and HGNC gene group.
Number of labels with a pvalue < 0.05
Number of labels with a mAP > 0.2
Though s1 works the best, its performance is partially due to reagents targeting the same gene retrieving each other. Regarding whether there is biological signal in the data, matching reagents that target the same gene is useful, but in the context of retrieving gene labels, it is not.
s2, s4, s5 and s6 have similar performances. s4, s5 and s6 are different ways of removing duplicates, but they don't seem to outperform s2, which is the simplest strategy. Based on these results, I am inclined to go with s2 for computing consensus ORF profiles.
s3 is the most complex of them all which seems to perform poorly for corum labels while for gene group labels its performance is on par with others. We may want to do a deep-dive at a later point to understand what's causing this.
@shntnu your thoughts?
Metadata
Metadata
Assignees
Type
Projects
Status
Activity
niranjchandrasekaran commentedon May 9, 2024on May 9, 2024
Though there aren't any major differences between
s2,s4,s5ands6, we decided to go withs2for ORF profiles. The odd behavior ofs3needs to be investigated. For now, I will close this issue and report back after I investigates3.Add a comment