Introduction

This is short primer on the use of medianBootstrap for the analysis of cell-to-cell spread data, using an example from Cheval et al. (2020) under CC BY 4.0.

Currently, most of the literature uses a Mann-Whitney-Wilcox test to analyse GFP cell-to-cell spread data. Here, we propose a medianBootstrap method, which accounts for the varying shape and variance between control and treatment/genotype conditions.

Example Use

## Load functions from medianBootstrap.R

# source(medianBootstrap.R)
# or

mcp_ci <- function(success, trials, alpha){
  ## Copyright (C) 2001 Frank E Harrell Jr
  ## Modified by Matthew G Johnston 2020 from binconf in the Hmisc package
  ## Distributed under GNU General Public License v2 (or later) without any warranty. See the GNU General Public License for more details.
  zcrit <-  - qnorm(alpha/2)
  z2 <- zcrit * zcrit
  mc_p <- success/trials
  cl <- (mc_p + z2/2/trials + c(-1, 1) * zcrit *
           sqrt((mc_p * (1 - mc_p) + z2/4/trials)/trials))/(1 + z2/trials)
  if(success == 1)
    cl[1] <-  - log(1 - alpha)/trials
  if(success == (trials - 1))
    cl[2] <- 1 + log(1 - alpha)/trials
  return(cbind(mc_p, lower_ci=cl[1], upper_ci=cl[2]))
}

medianBootstrap<- function(data1, data2, N=5000, alpha=0.05){
  ## Calculate observed test statistic
  mediandiff<-median(data1)-median(data2)
  ## Generate the null distribution
  boots<-replicate(N, median(sample(data1,length(data1), replace=T))-median(sample(data2,length(data2),  replace=T))-mediandiff)
  ## Count the number of at resampled observations which are at least as extreme
  above <- sum(abs(boots)>=abs(mediandiff))
  ## Calculate p value and confidence intervals
  mcp<-mcp_ci(above+1,N+1, alpha)
  return(mcp)
}

## Load example data from Cheval et al. (2020)
library(readxl)
ChevalCount <- read_excel("RawData.xlsx",
                          sheet = 1, col_names = T)

## Extract the relevant data into two vectors and plot them
Mock <- ChevalCount[ChevalCount$treatment=="Mock",]$movement
Chitin <- ChevalCount[ChevalCount$treatment=="Chitin",]$movement
par(mfrow=c(1,2))
hist(Mock)
hist(Chitin)

## Run the statistical test
medianBootstrap(Mock,Chitin, N=5000, alpha=0.05)
##            mc_p     lower_ci    upper_ci
## [1,] 0.00019996 1.025661e-05 0.001131863

More accurate confidence intervals

We can see here, that the two distributions are remarkably different. By using further bootstraps, we can refine the Monte Carlo p value and reduce the confidence intervals.

## Run the statistical test with more bootstraps
medianBootstrap(Mock,Chitin, N=25000, alpha=0.05)
##             mc_p    lower_ci     upper_ci
## [1,] 3.99984e-05 2.05165e-06 0.0002265524

Alternative confidence intervals

Instead of using Wilson (1927) confidence intervals, a bootstrap 95% confidence interval can be constructed:

## Generate bootstraped 95% confidence intervals

medianBootstrap_once<- function(data1, data2, N=5000, alpha=0.05){
  ## Calculate observed test statistic
  mediandiff<-median(data1)-median(data2)
  ## Generate the null distribution
  boots<-replicate(N, median(sample(data1,length(data1), replace=T))-median(sample(data2,length(data2),  replace=T))-mediandiff)
  ## Count the number of at resampled observations which are at least as extreme
  above <- sum(abs(boots)>=abs(mediandiff))
  ## Calculate p value
  mcp<-(above+1)/(N+1)
  return(mcp)
}

pvalue<-medianBootstrap_once(Mock,Chitin, N=5000, alpha=0.05)
interval<-replicate(100,medianBootstrap_once(Mock,Chitin, N=5000, alpha=0.05))

paste("p =", format(round(pvalue,5),nsmall=5), ", 95% CI = [", format(round(quantile(interval, c(0.025)),5),nsmall=5),",",format(round(quantile(interval, c(0.975)),5),nsmall=5),"]")
## [1] "p = 2e-04 , 95% CI = [ 2e-04 , 4e-04 ]"

Alternatively, Clopper-Pearson or Agresti–Coull could be used. The Hmisc package provides functions for this.

Plotting the null distribution

You may be interested to also view the null distribution generated by the medianBootstrap script. I provide an example function for this here.

library(ggplot2)
medianBootstrap_plot<- function(data1, data2, N=5000, alpha=0.05){
  ## Calculate observed test statistic
  mediandiff<-median(data1)-median(data2)
  ## Generate the null distribution
  boots<-replicate(N, median(sample(data1,length(data1), replace=T))-median(sample(data2,length(data2),  replace=T))-mediandiff)
  ## Count the number of at resampled observations which are at least as extreme
  above <- sum(abs(boots)>=abs(mediandiff))
  ## Calculate p value and confidence intervals
  mcp<-format(round(mcp_ci(above+1,N+1, alpha),3),nsmall=3)
  ## Plot graph
  labeltext1<- substitute(atop(paste(hat(italic("p")))~phantom()==phantom()~AAA,"95% CI"~BBB~CCC), list(AAA=mcp[1], BBB =paste0("[",mcp[2],", "), CCC= paste0(mcp[3],"]")))
  plot<-ggplot(data.frame(median=boots), aes(x=abs(median)))+
    geom_histogram(alpha=.7, binwidth = 1, aes(y=..density..))+
    geom_vline(xintercept = abs(mediandiff), linetype=2, colour = "red4")+
    theme_bw()+
    xlab(expression(atop("|"~hat(paste(theta, "*"))~-~hat(theta)~"|",theta==Delta[median])))+
    ylab("Density")+
    annotate("text",x = Inf, y = Inf, hjust = 1.1, vjust = 1.1,label=labeltext1)+
    theme(text=element_text(size=15))
  return(plot)
}

medianBootstrap_plot(Mock,Chitin, N=5000, alpha=0.05)

Extending the method to multiple comparisons

It is rare that we want to compare a control to only one treatment. Therefore, a method to control family-wise error rate is required. The below function will compare an unlimited number of vectors to the first and then perform a Holm correction (Holm, 1797).

medianBootstraps<- function(..., N=5000, alpha=0.05){
results<-NULL
x<-list(...)
reference<-x[[1]]
  for(i in 2:(length(x))){
  ## Calculate observed test statistic
  mediandiff<-median(reference)-median(x[[i]])
  ## Generate the null distribution
  boots<-replicate(N, median(sample(reference,length(reference), replace=T))-median(sample(x[[i]],length(x[[i]]), replace=T))-mediandiff)
  ## Count the number of at resampled observations which are at least as extreme
  above <- sum(abs(boots)>=abs(mediandiff))
  ## Calculate p value and confidence intervals
  mcp<-mcp_ci(above+1,N+1, alpha)
  results<-rbind(results, mcp)
}
## Calculate observed test statistics
return(cbind(pvaladj=p.adjust(results[,1]),pvaladj_lower=p.adjust(results[,2]),pvaladj_upper=p.adjust(results[,3])))
}

medianBootstraps(Mock,Chitin[1:100],Chitin[101:200],Chitin[201:300], N=5000, alpha=0.05)
##         pvaladj pvaladj_lower pvaladj_upper
## [1,] 0.00799840  5.879505e-03   0.010872566
## [2,] 0.00279944  1.356394e-03   0.005773286
## [3,] 0.00059988  3.076982e-05   0.003395589

Extending the method to the mean

meanBootstrap<- function(data1, data2, N=5000, alpha=0.05){
  ## Calculate observed test statistic
  meandiff<-mean(data1)-mean(data2)
  ## Generate the null distribution
  boots<-replicate(N, mean(sample(data1,length(data1), replace=T))-mean(sample(data2,length(data2),  replace=T))-meandiff)
  ## Count the number of at resampled observations which are at least as extreme
  above <- sum(abs(boots)>=abs(meandiff))
  ## Calculate p value and confidence intervals
  mcp<-mcp_ci(above+1,N+1, alpha)
  return(mcp)
}
meanBootstrap(Mock,Chitin, N=5000, alpha=0.05)
##            mc_p     lower_ci    upper_ci
## [1,] 0.00019996 1.025661e-05 0.001131863

Why do we subtract the observed median from the resamples?

To generate the null distribution, the function subtracts the observed median difference from the resampled median difference. This is done so as to correct the data so that the statistical test is conducted with H0 being true (that there is no difference in medians).

This a long hand way of doing this, where the datasets are recentred first to have their medians made the same (e.g. around 0). This is done below and compared to the method presented in the letter.

medianBootstrap_long_plot<- function(data1, data2, N=5000, alpha=0.05){
  ## Calculate observed test statistic
  mediandiff<-median(data1)-median(data2)
  
  ## Make H0 true (make the median of both vectors 0)
  new_data1 <- data1-median(data1)
  new_data2 <- data2-median(data2)
  
  ## Generate the null distribution
  boots<-replicate(N, median(sample(new_data1,length(new_data1), replace=T))-median(sample(new_data2,length(new_data2), replace=T)))
  ## Count the number of resampled observations which are at least as extreme
  above <- sum(abs(boots)>=abs(mediandiff))
  ## Calculate p value and confidence intervals
  mcp<-format(round(mcp_ci(above+1,N+1, alpha),3),nsmall=3)
  ## Plot the graph
  labeltext1<- substitute(atop(paste(hat(italic("p")))~phantom()==phantom()~AAA,"95% CI"~BBB~CCC), list(AAA=mcp[1], BBB =paste0("[",mcp[2],", "), CCC= paste0(mcp[3],"]")))
  plot<-ggplot(data.frame(median=boots), aes(x=abs(median)))+
    geom_histogram(alpha=.7, binwidth = 1, aes(y=..density..))+
    geom_vline(xintercept = abs(mediandiff), linetype=2, colour = "red4")+
    theme_bw()+
    xlab(expression(atop("|"~hat(paste(theta[centred], "*"))~"|",theta==Delta[median])))+
    ylab("Density")+
    annotate("text",x = Inf, y = Inf, hjust = 1.1, vjust = 1.1,label=labeltext1)+
    theme(text=element_text(size=15))
  return(plot)
}

set.seed(123)
original<-medianBootstrap_plot(Mock,Chitin, N=5000, alpha=0.05)+ggtitle("Subtract observed diff")
set.seed(123)
recentre<-medianBootstrap_long_plot(Mock,Chitin, N=5000, alpha=0.05)+ggtitle("Recentre first")

library(gridExtra)
grid.arrange(original,recentre, ncol=2)

References

C. Cheval et al., Chitin perception in plasmodesmata characterises submembrane immune signalling specificity in plants. PNAS, 117(17):9621–29, 2020.

S. Holm, A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6:65–70, 1979.

E.B. Wilson, Probable Inference, the Law of Succession, and Statistical Inference. Journal of the American Statistical Association 22: 209–12, 1927.

Session Information

devtools::session_info()
## - Session info ---------------------------------------------------------------
##  setting  value                       
##  version  R version 4.0.2 (2020-06-22)
##  os       Windows 10 x64              
##  system   x86_64, mingw32             
##  ui       RTerm                       
##  language (EN)                        
##  collate  English_United Kingdom.1252 
##  ctype    English_United Kingdom.1252 
##  tz       Europe/London               
##  date     2020-12-07                  
## 
## - Packages -------------------------------------------------------------------
##  package     * version date       lib source        
##  assertthat    0.2.1   2019-03-21 [1] CRAN (R 4.0.2)
##  callr         3.5.1   2020-10-13 [1] CRAN (R 4.0.3)
##  cellranger    1.1.0   2016-07-27 [1] CRAN (R 4.0.2)
##  cli           2.2.0   2020-11-20 [1] CRAN (R 4.0.3)
##  colorspace    2.0-0   2020-11-11 [1] CRAN (R 4.0.3)
##  crayon        1.3.4   2017-09-16 [1] CRAN (R 4.0.2)
##  desc          1.2.0   2018-05-01 [1] CRAN (R 4.0.2)
##  devtools      2.3.2   2020-09-18 [1] CRAN (R 4.0.3)
##  digest        0.6.27  2020-10-24 [1] CRAN (R 4.0.3)
##  dplyr         1.0.2   2020-08-18 [1] CRAN (R 4.0.2)
##  ellipsis      0.3.1   2020-05-15 [1] CRAN (R 4.0.2)
##  evaluate      0.14    2019-05-28 [1] CRAN (R 4.0.2)
##  fansi         0.4.1   2020-01-08 [1] CRAN (R 4.0.2)
##  farver        2.0.3   2020-01-16 [1] CRAN (R 4.0.2)
##  fs            1.5.0   2020-07-31 [1] CRAN (R 4.0.2)
##  generics      0.1.0   2020-10-31 [1] CRAN (R 4.0.3)
##  ggplot2     * 3.3.2   2020-06-19 [1] CRAN (R 4.0.2)
##  glue          1.4.2   2020-08-27 [1] CRAN (R 4.0.2)
##  gridExtra   * 2.3     2017-09-09 [1] CRAN (R 4.0.2)
##  gtable        0.3.0   2019-03-25 [1] CRAN (R 4.0.2)
##  htmltools     0.5.0   2020-06-16 [1] CRAN (R 4.0.2)
##  knitr         1.30    2020-09-22 [1] CRAN (R 4.0.2)
##  labeling      0.4.2   2020-10-20 [1] CRAN (R 4.0.3)
##  lifecycle     0.2.0   2020-03-06 [1] CRAN (R 4.0.2)
##  magrittr      2.0.1   2020-11-17 [1] CRAN (R 4.0.3)
##  memoise       1.1.0   2017-04-21 [1] CRAN (R 4.0.3)
##  munsell       0.5.0   2018-06-12 [1] CRAN (R 4.0.2)
##  pillar        1.4.7   2020-11-20 [1] CRAN (R 4.0.3)
##  pkgbuild      1.1.0   2020-07-13 [1] CRAN (R 4.0.2)
##  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.0.2)
##  pkgload       1.1.0   2020-05-29 [1] CRAN (R 4.0.2)
##  prettyunits   1.1.1   2020-01-24 [1] CRAN (R 4.0.2)
##  processx      3.4.5   2020-11-30 [1] CRAN (R 4.0.3)
##  ps            1.4.0   2020-10-07 [1] CRAN (R 4.0.3)
##  purrr         0.3.4   2020-04-17 [1] CRAN (R 4.0.2)
##  R6            2.5.0   2020-10-28 [1] CRAN (R 4.0.3)
##  Rcpp          1.0.5   2020-07-06 [1] CRAN (R 4.0.2)
##  readxl      * 1.3.1   2019-03-13 [1] CRAN (R 4.0.2)
##  remotes       2.2.0   2020-07-21 [1] CRAN (R 4.0.3)
##  rlang         0.4.7   2020-07-09 [1] CRAN (R 4.0.2)
##  rmarkdown     2.5     2020-10-21 [1] CRAN (R 4.0.3)
##  rprojroot     2.0.2   2020-11-15 [1] CRAN (R 4.0.3)
##  scales        1.1.1   2020-05-11 [1] CRAN (R 4.0.2)
##  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 4.0.3)
##  stringi       1.5.3   2020-09-09 [1] CRAN (R 4.0.2)
##  stringr       1.4.0   2019-02-10 [1] CRAN (R 4.0.2)
##  testthat      3.0.0   2020-10-31 [1] CRAN (R 4.0.3)
##  tibble        3.0.3   2020-07-10 [1] CRAN (R 4.0.2)
##  tidyselect    1.1.0   2020-05-11 [1] CRAN (R 4.0.2)
##  usethis       1.6.3   2020-09-17 [1] CRAN (R 4.0.3)
##  vctrs         0.3.4   2020-08-29 [1] CRAN (R 4.0.2)
##  withr         2.3.0   2020-09-22 [1] CRAN (R 4.0.2)
##  xfun          0.19    2020-10-30 [1] CRAN (R 4.0.3)
##  yaml          2.2.1   2020-02-01 [1] CRAN (R 4.0.2)
## 
## [1] C:/Users/mattj/Documents/R/win-library/4.0
## [2] C:/Program Files/R/R-4.0.2/library