Skip to content

How to compute consensus ORF profiles #178

Closed
@niranjchandrasekaran

Description

Script, Notebook

In the ORF dataset, for some genes, there are multiple ORF reagents targeting them. This means that, unlike the CRISPR dataset, computing consensus sequences by aggregating all samples with the same Metadata_JCP2022 will not work. Computing the consensus sequence is important whenever we need a gene-level profile.

I tested six scenarios to identify the ones that perform the best and also make the most biological sense.

  1. s1: aggregate profiles by Metadata_JCP2022.
  2. s2: aggregate profiles by Metadata_NCBI_Gene_ID.
  3. s3: aggregate profiles by Metadata_JCP2022 but use Metadata_NCBI_Gene_ID as pos_diffby in copairs.
  4. s4: aggregate profiles by Metadata_JCP2022 but keep the reagent with the highest mean_average_precision for phenotypic activity.
  5. s5: aggregate profiles by Metadata_JCP2022 but keep a random ORF reagent
  6. s6: aggregate profiles by Metadata_JCP2022 but keep the reagent with the longest insert size and highest proteins matching %. If there still are multiple reagents, then randomly choose one reagent.

I compared two ORF profiles, the one from the ORF pipeline (ORF) and the one from the CRISPR pipeline (ORF-CRISPR pipeline) and retrieved two gene labels, CORUM and HGNC gene group.

Number of labels with a pvalue < 0.05

consensus-profile-comparison-pvalue

Number of labels with a mAP > 0.2

consensus-profile-comparison-map

Though s1 works the best, its performance is partially due to reagents targeting the same gene retrieving each other. Regarding whether there is biological signal in the data, matching reagents that target the same gene is useful, but in the context of retrieving gene labels, it is not.

s2, s4, s5 and s6 have similar performances. s4, s5 and s6 are different ways of removing duplicates, but they don't seem to outperform s2, which is the simplest strategy. Based on these results, I am inclined to go with s2 for computing consensus ORF profiles.

s3 is the most complex of them all which seems to perform poorly for corum labels while for gene group labels its performance is on par with others. We may want to do a deep-dive at a later point to understand what's causing this.

@shntnu your thoughts?

Activity

niranjchandrasekaran

niranjchandrasekaran commented on May 9, 2024

Author

Though there aren't any major differences between s2, s4, s5 and s6, we decided to go with s2 for ORF profiles. The odd behavior of s3 needs to be investigated. For now, I will close this issue and report back after I investigate s3.

moved this from In Progress to Done in Morphmap-ORFon May 9, 2024
afermg

Add a comment

new Comment
Markdown input: edit mode selected.

Metadata

Metadata

Labels

ExperimentsTracking experimental questions, results, or analysisgood first issueGood for newcomers

Type

No type

Projects

Status

Milestone

No milestone

Relationships

None yet

    Development

    No branches or pull requests
      You're not receiving notifications from this thread.

      Participants

      @niranjchandrasekaran

      Issue actions

        How to compute consensus ORF profiles · Issue #178 · jump-cellpainting/morphmap
        normal