A modern implementation of the Super Learner algorithm for ensemble learning and model stacking
Authors: Jeremy Coyle, Nima Hejazi, Ivana Malenica, Oleg Sofrygin
sl3
?sl3
is a modern implementation of the Super Learner algorithm of van der Laan, Polley, and Hubbard (2007). The Super Learner algorithm performs ensemble learning in one of two fashions:
sl3
nomenclature) – that is, that algorithm which minimizes the cross-validated risk with respect to some appropriate loss function.sl3
makes the process of applying screening algorithms, learning algorithms, combining both types of algorithms into a stacked regression model, and cross-validating this whole process essentially trivial. The best way to understand this is to see the sl3
package in action:
set.seed(49753)
suppressMessages(library(data.table))
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:data.table':
#>
#> between, first, last
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(SuperLearner)
#> Loading required package: nnls
#> Super Learner
#> Version: 2.0-23-9000
#> Package created on 2017-11-29
library(origami)
#> origami: Generalized Cross-Validation Framework
#> Version: 1.0.0
library(sl3)
# load example data set
data(cpp)
cpp <- cpp %>%
dplyr::filter(!is.na(haz)) %>%
mutate_all(funs(replace(., is.na(.), 0)))
# use covariates of intest and the outcome to build a task object
covars <- c("apgar1", "apgar5", "parity", "gagebrth", "mage", "meducyrs",
"sexn")
task <- sl3_Task$new(cpp, covariates = covars, outcome = "haz")
# set up screeners and learners via built-in functions and pipelines
slscreener <- Lrnr_pkg_SuperLearner_screener$new("screen.glmnet")
glm_learner <- Lrnr_glm$new()
screen_and_glm <- Pipeline$new(slscreener, glm_learner)
SL.glmnet_learner <- Lrnr_pkg_SuperLearner$new(SL_wrapper = "SL.glmnet")
# stack learners into a model (including screeners and pipelines)
learner_stack <- Stack$new(SL.glmnet_learner, glm_learner, screen_and_glm)
stack_fit <- learner_stack$train(task)
#> Loading required package: glmnet
#> Loading required package: Matrix
#> Loading required package: foreach
#> Loaded glmnet 2.0-16
preds <- stack_fit$predict()
head(preds)
#> Lrnr_pkg_SuperLearner_SL.glmnet Lrnr_glm_TRUE
#> 1: 0.35345519 0.36298498
#> 2: 0.35345519 0.36298498
#> 3: 0.24554305 0.25993072
#> 4: 0.24554305 0.25993072
#> 5: 0.24554305 0.25993072
#> 6: 0.02953193 0.05680264
#> Lrnr_pkg_SuperLearner_screener_screen.glmnet___Lrnr_glm_TRUE
#> 1: 0.36228209
#> 2: 0.36228209
#> 3: 0.25870995
#> 4: 0.25870995
#> 5: 0.25870995
#> 6: 0.05600958
It is our hope that sl3
will grow to be widely used for creating stacked regression models and the cross-validation of pipelines that make up such models, as well as the variety of other applications in which the Super Learner algorithm plays a role. To that end, contributions are very welcome, though we ask that interested contributors consult our contribution guidelines prior to submitting a pull request.
After using the sl3
R package, please cite the following:
@misc{coyle2018sl3,
author = {Coyle, Jeremy R and Hejazi, Nima S and Malenica, Ivana and
Sofrygin, Oleg},
title = {{sl3}: Modern Pipelines for Machine Learning and {Super
Learning}},
year = {2018},
howpublished = {\url{https://github.com/tlverse/sl3}},
url = {https://doi.org/DOI_HERE},
doi = {DOI_HERE}
}
© 2017-2018 Jeremy R. Coyle, Nima S. Hejazi, Ivana Malenica, Oleg Sofrygin
The contents of this repository are distributed under the GPL-3 license. See file LICENSE
for details.