ComBat Family Overview

Andrew Chen

2024-07-30

The ComBat Family extends the original ComBat methodology to enable flexible covariate modeling, leveraging efficient R implementations of regression models. A method that belongs in the ComBat Family satisfies the following conditions:

  1. Modeling of covariate effects in location and/or scale
  2. Batch effects in location and scale of measurements
  3. Empirical Bayes step for borrowing information across features

Denote data \(y_{ijv}\) and covariates \(\boldsymbol{x}_{ij}\) for sites \(i = 1,2,\ldots,M\), subjects within site \(j = 1,2,\ldots,n_i\), and features \(v = 1,2,\ldots,p\). The ComBat Family assumes

\[y_{ijv} \sim D(\mu_{ijv} + \gamma_{iv}, \delta_{iv}\sigma_{ijv})\] where \(D\) is a specified distribution (typically normal), \(\gamma_{iv}\) is the additive location batch effect and \(\delta_{iv}\) is the multiplicative scale batch effect. The mean and variance of \(D\) are modeled via

\[g_1(\mu_{ijv}) = f_v(\boldsymbol{x}_{ij}) \qquad g_2(\sigma_{ijv}) = h_v(\boldsymbol{x}_{ij})\]

where \(g_1, g_2\) are link functions and \(f_v, h_v\) are specified functions.

ComBat

ComBat Family defaults to the original linear model used in ComBat, using formula to specify covariates included in the model.

suppressPackageStartupMessages({
  library(neuroCombat)
  library(ComBatFamily)
})

# generate toy dataset
set.seed(8888)
n <- 20
p <- 5
bat <- as.factor(c(rep("a", n/2), rep("b", n/2)))
q <- 2
covar <- matrix(rnorm(n*q), n, q)
colnames(covar) <- paste0("x", 1:q)
data <- data.frame(matrix(rnorm(n*p), n, p))

cf <- comfam(data, bat, covar, lm, formula = y ~ x1 + x2)
c <- neuroCombat(t(data), bat, covar, verbose = FALSE)

max(cf$dat.combat - t(c$dat.combat))
#> [1] 1.332268e-15
# no covariates specified
cf <- comfam(data, bat)
c <- neuroCombat(t(data), bat, verbose = FALSE)

max(cf$dat.combat - t(c$dat.combat))
#> [1] 4.440892e-16

ComBat-GAM

ComBat-GAM (Pomponio et al., 2020) was previously only available in Python. ComBat Family enables fitting of a general additive model (GAM) using notation consistent with the gam function.

suppressPackageStartupMessages(
  library(mgcv)
)

cg <- combat_gam(data, bat, covar, formula = y ~ s(x1) + x2)
# # alternate syntax
# cg <- comfam(data, bat, covar, gam, formula = y ~ s(x1) + x2)

ComBatLS

ComBatLS (Gardner et al.) is recent location- and scale-preserving method for multi-site image harmonization. ComBatLS models covariate effects in the location and scale of measurements via GAMLSS. ComBat-LS should be considered in settings with potential covariate effects on scale (e.g. age and sex affecting the variance of imaging features).

suppressPackageStartupMessages(
  library(gamlss)
)

cgl <- combatls(data, bat, covar, y ~ x1 + x2,
                sigma.formula = ~ x1 + x2, 
                control = gamlss.control(trace = FALSE))
# # alternate syntax
# cgl <- comfam(data, bat, covar, gamlss, y ~ x1 + x2,
#               sigma.formula = ~ x1 + x2, 
#               control = gamlss.control(trace = FALSE))

Longitudinal ComBat

The ComBat family includes Longitudinal ComBat (Beer et al. 2020), previously implemented here. Currently, comfam uses the mean squared residual (MSR) estimator for error variance, but the REML estimator will be implemented at a later date.

# devtools::install_github("jcbeer/longCombat")
suppressPackageStartupMessages({
  library(longCombat)
  library(lme4)
})

# generate toy dataset
n <- 20
p <- 5
r <- 2 # repeats
bat <- as.factor(rep(c(rep("a", n/2), rep("b", n/2)), r))
q <- 2
covar <- matrix(rnorm(n*r*q), n*r, q)
colnames(covar) <- paste0("x", 1:q)
covar <- cbind(covar, ID = rep(1:n, r))
data <- data.frame(matrix(rnorm(n*r*p), n*r, p))

cf <- long_combat(data, bat, covar, y ~ x1 + x2 + (1 | ID))
# # alternate syntax
# cf <- comfam(data, bat, covar, lmer, y ~ x1 + x2 + (1 | ID))

c <- longCombat("ID", "x1", "bat", 1:p, "x1 + x2", "(1 | ID)",
                data = cbind(data, covar, bat), method = "MSR", verbose = FALSE)

max(cf$dat.combat - c$data_combat[,-(1:3)])
#> [1] 4.171635e-09

References

Beer, J. C., Tustison, N. J., Cook, P. A., Davatzikos, C., Sheline, Y. I., Shinohara, R. T., & Linn, K. A. (2020). Longitudinal ComBat: A method for harmonizing longitudinal multi-scanner imaging data. NeuroImage, 220, 117129. https://doi.org/10.1016/j.neuroimage.2020.117129

Johnson, W. E., Li, C., & Rabinovic, A. (2007). Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics, 8(1), 118–127. https://doi.org/10.1093/biostatistics/kxj037

Gardner, M., Shinohara, R. T., Bethlehem, R. A. I., Romero-Garcia, R., Warrier, V., Dorfschmidt, L., Shanmugan, S., Seidlitz, J., Alexander-Bloch, A., & Chen, A. A. (2024). ComBatLS: A location- and scale-preserving method for multi-site image harmonization. bioRxiv, 2024.06.21.599875. https://doi.org/10.1101/2024.06.21.599875

Pomponio, R., Erus, G., Habes, M., Doshi, J., Srinivasan, D., Mamourian, E., Bashyam, V., Nasrallah, I. M., Satterthwaite, T. D., Fan, Y., Launer, L. J., Masters, C. L., Maruff, P., Zhuo, C., Völzke, H., Johnson, S. C., Fripp, J., Koutsouleris, N., Wolf, D. H., … Davatzikos, C. (2020). Harmonization of large MRI datasets for the analysis of brain imaging patterns throughout the lifespan. NeuroImage, 208, 116450. https://doi.org/10.1016/j.neuroimage.2019.116450