In this article we will cover functions integrated in gnomeR
to analyze the outputed processed data of the binmat()
function. We include in gnomeR
the following functions for analysis:
gen.summary()
which test for association between either categorical or continous outcomes and the set of genomic features from the binmat()
output matrix. Returns a table with the distribution of the genomic features along the outcome specified, using adapted tests and adjustements for multiple testing to report significance. We further summarize the results using adequate plots.
uni.cox()
which performs univariate assocition analysis between a time-dependent outcome and the set of genomic features from the binmat()
output matrix. We perform this using Cox’s proportional hazard models with multiple test adjustements and summary plots.
In this first section we focus on the gen.summary()
function, it takes the following arguments:
gen.dat
a genomic dataframe outputed from binmat()
.outcome
a leveled vector of length equal to the number of rows in gen.dat
.filter
a numeric value between 0 and 1 (1 not included) that is the lower bound for the proportion of patients having a genetic event. All features with an event rate lower than that value will be removed. Default is 0 (all features included).paired
Boolean if the data are paired. Default is FALSE.cont
Should the outcome be treated as a continuous value. Default is FALSE treated as categorical.rank
Should the table returned be ordered by Pvalue. Boolean, default is TRUE.In the following analyses we will be using the following subset of samples:
set.seed(123)
samples <- as.character(unique(mut$Tumor_Sample_Barcode))[sample(1:length(unique(mut$Tumor_Sample_Barcode)), 100, replace=FALSE)]
df <- binmat(patients = samples,maf = mut, fusion = fusion, cna = cna, cna.binary = FALSE)
For this example we will be using the recorded sample type of the samples we randomly selected above. Thus looking for differences between primary and metastatic samples:
outcome <- as.character(clin.sample$Sample.Type[match(samples,clin.sample$Sample.Identifier)])
test <- gen.summary(gen.dat = df,
outcome = outcome,
filter = 0.05,paired = FALSE,
cont = FALSE,rank = TRUE)
Using the Fisher’s exact test the gen.summary()
function returns the following:
fits
a summary table of the distribution of features across outcomes and the test performed
kable(test$fits[1:10,])
Feature | Overall | Metastasis(N=55) | Primary(N=45) | OddsRatio | Pvalue | FDR | Lower | Upper | |
---|---|---|---|---|---|---|---|---|---|
FGFR1.cna | FGFR1.cna | 12% | 21.82% | 0% | 0 | 3.12e-02 | 8.19e-01 | 0 | 0.99 |
MLL2 | MLL2 | 5% | 9.09% | 0% | 0 | 6.24e-02 | 8.19e-01 | 0 | 1.29 |
CDKN2B.cna | CDKN2B.cna | -12% | -3.64% | -22.22% | 0.15 | 8.75e-02 | 8.19e-01 | 0 | 1.42 |
CREBBP | CREBBP | 7% | 10.91% | 2.22% | 0.19 | 1.25e-01 | 8.19e-01 | 0 | 1.64 |
ERBB2.cna | ERBB2.cna | 10% | 3.64% | 17.78% | 5.19 | 1.71e-01 | 8.19e-01 | 0.49 | 263.77 |
MLL3 | MLL3 | 6% | 9.09% | 2.22% | 0.23 | 2.19e-01 | 8.19e-01 | 0 | 2.17 |
CDH1 | CDH1 | 6% | 9.09% | 2.22% | 0.23 | 2.19e-01 | 8.19e-01 | 0 | 2.17 |
MYC.cna | MYC.cna | 12% | 18.18% | 4.44% | 0.23 | 2.19e-01 | 8.19e-01 | 0 | 2.17 |
EGFR | EGFR | 9% | 5.45% | 13.33% | 2.64 | 2.92e-01 | 8.19e-01 | 0.52 | 17.34 |
KEAP1 | KEAP1 | 5% | 7.27% | 2.22% | 0.29 | 3.75e-01 | 8.19e-01 | 0.01 | 3.1 |
forest.plot
a forest plot of the most significant features
test$forest.plot
vPlot
a volcano plot summarising the odds ratios and pvalues of the the features included in the test
test$vPlot
In this section we show to use the gen.summary()
function with a continuous outcome. Given the lack of continuous variables in the IMPACT clinical set we show an example here with a simulate continuous gaussian outcome. We perform this using linear regression.
set.seed(1)
outcome <- rnorm(n = nrow(df))
tab.out <- gen.summary(gen.dat = df,
outcome = outcome,
filter = 0.05,paired = FALSE,
cont = TRUE,rank = TRUE)
This returns the following:
fits
a summary of the univariate fits that were performed
kable(tab.out$fits[1:10,])
Estimate | SD | Pvalue | MutationFreq | FDR | |
---|---|---|---|---|---|
MCL1.cna | -0.34 | 0.19 | 0.0755 | 0.12 | 9.57e-01 |
EPHA5 | 0.71 | 0.41 | 0.0830 | 0.05 | 9.57e-01 |
AR | 0.70 | 0.41 | 0.0916 | 0.05 | 9.57e-01 |
PTPRT | 0.61 | 0.38 | 0.1080 | 0.06 | 9.57e-01 |
RB1 | -0.53 | 0.38 | 0.1590 | 0.06 | 9.57e-01 |
TSC1 | 0.44 | 0.41 | 0.2900 | 0.05 | 9.57e-01 |
TP53 | 0.19 | 0.19 | 0.3080 | 0.38 | 9.57e-01 |
DOT1L | -0.36 | 0.41 | 0.3890 | 0.05 | 9.57e-01 |
STK11 | 0.28 | 0.35 | 0.4240 | 0.07 | 9.57e-01 |
CDKN2B.cna | 0.14 | 0.19 | 0.4540 | -0.12 | 9.57e-01 |
vPlot
a volcano plot summarising the coefficient and pvalues of the the features included in the test
tab.out$vPlot
The univariate survival analysis can be performed using the uni.cox()
function, it takes the following arguments:
X
a matrix/surv.datframe of genomic features, continuous or binary (note cannot handle categorical surv.dat for the moment).surv.dat
a survival dataframe containing the survival information. This can be made of 2 or 3 columns. 1 or 2 for time, and one for status (where 1 is event and 0 is no event).surv.formula
a survival formula with names matching those in surv.dat eg: Surv(time,status)~.filter
a numeric value between 0 and 1 (1 not included) that is the lower bound for the proportion of patients having a genetic event. All features with an event rate lower than that value will be removed. Default is 0 (all features included).genes
a character vector of gene names that will be the only ones to be kept. Default is NULL, all genes are used.Here we show an example using the survival data included in the IMPACT clinical dataset:
surv.dat <- clin.patients %>%
filter(X.Patient.Identifier %in%
abbreviate(samples,strict = TRUE, minlength = 9)) %>%
select(X.Patient.Identifier,Overall.Survival..Months.,
Overall.Survival.Status) %>%
rename(DMPID = X.Patient.Identifier,
time = Overall.Survival..Months.,
status = Overall.Survival.Status) %>%
mutate(time = as.numeric(as.character(time)),
status = ifelse(status == "LIVING",0,1)) %>%
filter(!is.na(time))
X <- df[match(surv.dat$DMPID,
abbreviate(rownames(df),strict = TRUE, minlength = 9)),]
test <- uni.cox(X = X, surv.dat = surv.dat,
surv.formula = Surv(time,status)~.,filter = 0.1)
This returns the following:
tab
a summary of the univariate fits that were performed
kable(test$tab[1:5,])
Feature | Coefficient | HR | Pvalue | FDR | MutationFrequency | |
---|---|---|---|---|---|---|
TERT | TERT | 0.89 | 2.44 | 0.059 | 0.354 | 0.14 |
MLL | MLL | 0.70 | 2.01 | 0.203 | 0.458 | 0.11 |
TP53 | TP53 | 0.49 | 1.63 | 0.229 | 0.458 | 0.36 |
PIK3CA | PIK3CA | 0.38 | 1.46 | 0.447 | 0.671 | 0.14 |
KRAS | KRAS | -0.42 | 0.66 | 0.568 | 0.682 | 0.14 |
p
an interactive volcano plot summarizing the hazard ratio for each ratio as a function of it’s p-value
test$p
KM
a list of Kaplan-Meier plots for the 10 most significant features
test$KM[[1]]