library(mikropml)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
Speed up single runs
By default, preprocess_data()
, run_ml()
,
and compare_models()
use only one process in series. If
you’d like to parallelize various steps of the pipeline to make them run
faster, install foreach
, future
,
future.apply
, and doFuture
. Then, register a
future plan prior to calling these functions:
doFuture::registerDoFuture()
future::plan(future::multicore, workers = 2)
Above, we used the multicore
plan to split the work
across 2 cores. See the future
documentation for more about picking the best plan for your use
case. Notably, multicore
does not work inside RStudio or on
Windows; you will need to use multisession
instead in those
cases.
After registering a future plan, you can call
preprocess_data()
and run_ml()
as usual, and
they will run certain tasks in parallel.
otu_data_preproc <- preprocess_data(otu_mini_bin, 'dx')$dat_transformed
#> Using 'dx' as the outcome column.
result1 <- run_ml(otu_data_preproc, 'glmnet')
#> Using 'dx' as the outcome column.
#> Training the model...
#> Loading required package: ggplot2
#> Loading required package: lattice
#>
#> Attaching package: 'caret'
#> The following object is masked from 'package:mikropml':
#>
#> compare_models
#> Training complete.
There’s a also a parallel version of the rf
engine
called parRF
which trains the trees in the forest in
parallel. See the caret
docs for more information.
Call run_ml()
multiple times in parallel in R
You can use functions from the future.apply
package to
call run_ml()
multiple times in parallel with different
parameters. You will first need to run future::plan()
as
above if you haven’t already. Then, call run_ml()
with
multiple seeds using future_lapply()
:
# NOTE: use more seeds for real-world data
results_multi <- future.apply::future_lapply(seq(100, 102), function(seed) {
run_ml(otu_data_preproc, 'glmnet', seed = seed)
}, future.seed = TRUE)
#> Using 'dx' as the outcome column.
#> Training the model...
#> Training complete.
#> Using 'dx' as the outcome column.
#> Training the model...
#> Training complete.
#> Using 'dx' as the outcome column.
#> Training the model...
#> Training complete.
Each call to run_ml()
with a different seed uses a
different random split of the data into training and testing sets. Since
we are using seeds, we must set future.seed
to
TRUE
(see the future.apply
documentation and this
blog post for details on parallel-safe random seeds). This example
uses only a few seeds for speed and simplicity, but for real data we
recommend using many more seeds to get a better estimate of model
performance.
In these examples, we used functions from the
future.apply
package to run_ml()
in parallel,
but you can accomplish the same thing with parallel versions of the
purrr::map()
functions using the furrr
package
(e.g. furrr::future_map_dfr()
).
Extract the performance results and combine into one dataframe for all seeds:
perf_df <- future.apply::future_lapply(results_multi,
function(result) {
result[['performance']] %>%
select(cv_metric_AUC, AUC, method)
},
future.seed = TRUE) %>%
dplyr::bind_rows()
perf_df
#> # A tibble: 3 × 3
#> cv_metric_AUC AUC method
#> <dbl> <dbl> <chr>
#> 1 0.630 0.634 glmnet
#> 2 0.591 0.608 glmnet
#> 3 0.671 0.471 glmnet
Multiple ML methods
You may also wish to compare performance for different ML methods.
mapply()
can iterate over multiple lists or vectors, and
future_mapply()
works the same way:
# NOTE: use more seeds for real-world data
param_grid <- expand.grid(seeds = seq(100, 102),
methods = c('glmnet', 'rf'))
results_mtx <- future.apply::future_mapply(
function(seed, method) {
run_ml(otu_data_preproc, method, seed = seed)
},
param_grid$seeds,
param_grid$methods %>% as.character(),
future.seed = TRUE
)
#> Using 'dx' as the outcome column.
#> Training the model...
#> Training complete.
#> Using 'dx' as the outcome column.
#> Training the model...
#> Training complete.
#> Using 'dx' as the outcome column.
#> Training the model...
#> Training complete.
#> Using 'dx' as the outcome column.
#> Training the model...
#> Training complete.
#> Using 'dx' as the outcome column.
#> Training the model...
#> Training complete.
#> Using 'dx' as the outcome column.
#> Training the model...
#> Training complete.
Extract and combine the performance results for all seeds and methods:
perf_df2 <- lapply(results_mtx['performance',],
function(x) {
x %>% select(cv_metric_AUC, AUC, method)
}) %>%
dplyr::bind_rows()
perf_df2
#> # A tibble: 6 × 3
#> cv_metric_AUC AUC method
#> <dbl> <dbl> <chr>
#> 1 0.630 0.634 glmnet
#> 2 0.591 0.608 glmnet
#> 3 0.671 0.471 glmnet
#> 4 0.665 0.708 rf
#> 5 0.651 0.697 rf
#> 6 0.701 0.592 rf
Visualize the performance results (ggplot2
is
required):
perf_boxplot <- plot_model_performance(perf_df2)
perf_boxplot
plot_model_performance()
returns a ggplot2 object. You
can add layers to customize the plot:
perf_boxplot +
theme_classic() +
scale_color_brewer(palette = "Dark2") +
coord_flip()
You can also create your own plots however you like using the performance results.
Live progress updates
preprocess_data()
and
get_feature_importance()
support reporting live progress
updates using the progressr
package. The format is up to
you, but we recommend using a progress bar like this:
# optionally, specify the progress bar format with the `progress` package.
progressr::handlers(progressr::handler_progress(
format = ":message :bar :percent | elapsed: :elapsed | eta: :eta",
clear = FALSE,
show_after = 0))
# tell progressr to always report progress in any functions that use it.
# set this to FALSE to turn it back off again.
progressr::handlers(global = TRUE)
# run your code and watch the live progress updates.
dat <- preprocess_data(otu_mini_bin, 'dx')$dat_transformed
#> Using 'dx' as the outcome column.
#> preprocessing ========================>------- 78% | elapsed: 1s | eta: 0s
results <- run_ml(dat, "glmnet", kfold = 2, cv_times = 2,
find_feature_importance = TRUE)
#> Using 'dx' as the outcome column.
#> Training the model...
#> Training complete.
#> Feature importance =========================== 100% | elapsed: 37s | eta: 0s
Note that some future backends support “near-live” progress updates,
meaning the progress may not be reported immediately when parallel
processing with futures. Read more on that in
the progressr
vignette. For more on
progressr
and how to customize the format of progress
updates, see the progressr
docs.
Parallelizing with Snakemake
When parallelizing multiple calls to run_ml()
in R as in
the examples above, all of the results objects are held in memory. This
isn’t a big deal for a small dataset run with only a few seeds. However,
for large datasets run in parallel with, say, 100 seeds (recommended),
you may run into problems trying to store all of those objects in memory
at once.
Using a workflow manager such as Snakemake or Nextflow is highly recommend to maximize the scalability and reproducibility of computational analyses. We created a template Snakemake workflow here which you can use as a starting point for your ML project.