This document consists of some simple benchmarks for various choices of Super Learner implementation, wrapper functions, and parallelization schemes. The purpose of this document is two-fold:
n = 1e4
data(cpp_imputed)
cpp_big <- cpp_imputed[sample(nrow(cpp_imputed), n, replace = TRUE), ]
covars <- c("apgar1", "apgar5", "parity", "gagebrth", "mage", "meducyrs",
"sexn")
outcome <- "haz"
task <- sl3_Task$new(cpp_big, covariates = covars, outcome = outcome,
outcome_type = "continuous")
SuperLearner
The legacy SuperLearner package serves as a suitable baseline. We can fit it sequentially (no parallelization):
time_SuperLearner_sequential <- system.time({
SuperLearner(task$Y, as.data.frame(task$X), newX = NULL, family = gaussian(),
SL.library = c("SL.glmnet", "SL.randomForest", "SL.speedglm"),
method = "method.NNLS", id = NULL, verbose = FALSE,
control = list(), cvControl = list(), obsWeights = NULL,
env = parent.frame())
})
We can also fit it using multicore parallelization, using the mcSuperLearner
function.
options(mc.cores = cpus_physical)
time_SuperLearner_multicore <- system.time({
mcSuperLearner(task$Y, as.data.frame(task$X), newX = NULL,
family = gaussian(),
SL.library = c("SL.glmnet", "SL.randomForest", "SL.speedglm"),
method = "method.NNLS", id = NULL, verbose = FALSE,
control = list(), cvControl = list(), obsWeights = NULL,
env = parent.frame())
})
The SuperLearner
package supports a number of other parallelization schemes, although these weren’t tested here.
sl3
with Legacy SuperLearner
WrappersTo maximize comparability with the legacy implementation, we can use sl3
with the SuperLearner
wrappers, so that the actual computation used to train the learners is identical:
sl_glmnet <- Lrnr_pkg_SuperLearner$new("SL.glmnet")
sl_random_forest <- Lrnr_pkg_SuperLearner$new("SL.randomForest")
sl_speedglm <- Lrnr_pkg_SuperLearner$new("SL.speedglm")
nnls_lrnr <- Lrnr_nnls$new()
sl3_legacy <- Lrnr_sl$new(list(sl_random_forest, sl_glmnet, sl_speedglm),
nnls_lrnr)
sl3
with Native LearnersWe can also use native sl3
learners, which have been rewritten to be performant on large sample sizes:
lrnr_glmnet <- Lrnr_glmnet$new()
random_forest <- Lrnr_randomForest$new()
glm_fast <- Lrnr_glm_fast$new()
nnls_lrnr <- Lrnr_nnls$new()
sl3_native <- Lrnr_sl$new(list(random_forest, lrnr_glmnet, glm_fast), nnls_lrnr)
sl3
Parallelization Optionssl3
uses the delayed package to parallelize training tasks. Delayed, in turn, uses the future package to support a range of parallel back-ends. We test several of these, for both the legacy wrappers and native learners.
First, sequential evaluation (no parallelization):
plan(sequential)
test <- delayed_learner_train(sl3_legacy, task)
time_sl3_legacy_sequential <- system.time({
sched <- Scheduler$new(test, SequentialJob)
cv_fit <- sched$compute()
})
test <- delayed_learner_train(sl3_native, task)
time_sl3_native_sequential <- system.time({
sched <- Scheduler$new(test, SequentialJob)
cv_fit <- sched$compute()
})
Next, multicore parallelization:
plan(multicore, workers = cpus_physical)
test <- delayed_learner_train(sl3_legacy, task)
time_sl3_legacy_multicore <- system.time({
sched <- Scheduler$new(test, FutureJob, nworkers = cpus_physical,
verbose = FALSE)
cv_fit <- sched$compute()
})
test <- delayed_learner_train(sl3_native, task)
time_sl3_native_multicore <- system.time({
sched <- Scheduler$new(test, FutureJob, nworkers = cpus_physical,
verbose = FALSE)
cv_fit <- sched$compute()
})
We also test multicore parallelization with hyper-threading – we use a number of workers equal to the number of logical, not physical, cores:
plan(multicore, workers = cpus_logical)
test <- delayed_learner_train(sl3_legacy, task)
time_sl3_legacy_multicore_ht <- system.time({
sched <- Scheduler$new(test, FutureJob, nworkers = cpus_logical,
verbose = FALSE)
cv_fit <- sched$compute()
})
test <- delayed_learner_train(sl3_native, task)
time_sl3_native_multicore_ht <- system.time({
sched <- Scheduler$new(test, FutureJob, nworkers = cpus_logical,
verbose = FALSE)
cv_fit <- sched$compute()
})
Finally, we test parallelization using multisession:
plan(multisession, workers = cpus_physical)
test <- delayed_learner_train(sl3_legacy, task)
time_sl3_legacy_multisession <- system.time({
sched <- Scheduler$new(test, FutureJob, nworkers = cpus_physical,
verbose = FALSE)
cv_fit <- sched$compute()
})
test <- delayed_learner_train(sl3_native, task)
time_sl3_native_multisession <- system.time({
sched <- Scheduler$new(test, FutureJob, nworkers = cpus_physical,
verbose = FALSE)
cv_fit <- sched$compute()
})
We can see that using the native learners results in about a 4x speedup relative to the legacy wrappers. This can be at least partially explained by the fact that legacy SL.randomForest
wrapper uses randomForest.formula
for continuous data, which resorts to using the model.matrix
function, known to be slow on large datasets. Improvements to the legacy wrappers would probably reduce or eliminate this difference.
We can also see that multicore parallelization for the legacy SuperLearner
function results in another 4x speedup on this system. Relative to that, the sl3_legacy_multicore
test results in almost an additional 2x speedup. This can be explained by the use of delayed parallelization. While mcSuperLearner
parallelizes simply across the \(V\) cross-validation folds, delayed
allows sl3
to parallelize across all training tasks that comprise the SuperLearner, which is a total of \((V+1)*n_{learners}\) training tasks, \(n_{learners}\) is the number of learners in the library (here 4), and \((V+1)\) is one more than the number of cross-validation folds, accounting for the re-fit to the full data typically implemented in the SuperLearner
algorithm. We don’t see a substantial difference between the three parallelization schems for sl3.
These effects appear multiplicative, resulting in the fastest implementation, sl3_improved_multicore_ht
(sl3
with native learners and hyper-threaded multicore parallelization), being about 32x faster than the slowest, SuperLearner_sequential
(Legacy SuperLearner
without parallelization). This is a dramatic improvement in the time required to run this SuperLearner.
R version 3.4.4 (2018-03-15) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Red Hat Enterprise Linux Server release 6.9 (Santiago)
Matrix products: default BLAS/LAPACK: /usr/lib64/libopenblas-r0.2.20.so
locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages: [1] parallel stats graphics grDevices utils datasets base
other attached packages: [1] speedglm_0.3-2 MASS_7.3-49
[3] randomForest_4.6-12 glmnet_2.0-13
[5] foreach_1.4.4 Matrix_1.2-12
[7] scales_0.5.0.9000 stringr_1.3.0
[9] data.table_1.10.4-3 ggplot2_2.2.1.9000
[11] future_1.7.0 SuperLearner_2.0-23-9000 [13] nnls_1.4 delayed_0.2.1
[15] sl3_1.0.0 knitr_1.20
[17] nima_0.4.6 fcuk_0.1.21
loaded via a namespace (and not attached): [1] stringdist_0.9.4.7 origami_1.0.0 gtools_3.5.0
[4] purrr_0.2.4 listenv_0.7.0 lattice_0.20-35
[7] ggthemes_3.4.0 colorspace_1.3-2 htmltools_0.3.6
[10] yaml_2.1.18 rlang_0.2.0.9000 pillar_1.2.1
[13] withr_2.1.1.9000 uuid_0.1-2 ProjectTemplate_0.8 [16] plyr_1.8.4 munsell_0.4.3 gtable_0.2.0
[19] visNetwork_2.0.3 devtools_1.13.5 htmlwidgets_1.0
[22] codetools_0.2-15 evaluate_0.10.1 memoise_1.1.0
[25] rstackdeque_1.1.1 methods_3.4.4 Rcpp_0.12.16
[28] backports_1.1.2 checkmate_1.8.5 jsonlite_1.5
[31] abind_1.4-5 gridExtra_2.3 digest_0.6.15
[34] stringi_1.1.7 BBmisc_1.11 grid_3.4.4
[37] rprojroot_1.3-2 tools_3.4.4 magrittr_1.5
[40] lazyeval_0.2.1 tibble_1.4.2 future.apply_0.1.0 [43] pkgconfig_2.0.1 iterators_1.0.9 assertthat_0.2.0
[46] rmarkdown_1.9 R6_2.2.2 globals_0.11.0
[49] igraph_1.2.1 compiler_3.4.4