Contents

1 Introduction

Model selection is an important step in probabilistic modelling. As happens in most Bayesian models, the optimisation procedure of MOFA is not guaranteed to find a consistent optimal solution at every trial, and factors can vary between different model instances. Hence, it is important to asess the consistency of factors we trained models under different random parameter initialisations (different seeds).

Having said that, we observe MOFA to be very robust over trials. In practice, a single model is totally fine for an exploratory analysis.

2 Load libraries

library(MOFA2)
## Warning: replacing previous import 'DelayedArray::pnorm' by 'stats::pnorm'
## when loading 'MOFA2'

3 Simulate an example data set

To illustrate the MOFA workflow we simulate a small example data set with 3 different views. make_example_data generates an untrained MOFAobject containing the simulated data. If you work on your own data use create_mofa to create the untrained MOFA object (see our vignettes on scRNA (gastrulation) or scMethylation (cortex)).
By default the function make_example_data produces a small data set containing 3 views with 100 features each and 2 groups with 50 samples each. These parameters can be varied using the arguments n_views, n_features, n_groups and n_samples.

set.seed(1234)
sim_data <- make_example_data()
MOFAobject <- create_mofa(sim_data$data, groups = sim_data$groups)
## Creating MOFA object from a list of matrices (features as rows, sample as columns)...
## Warning in quality_control(object): There are duplicated features names across different views. We will add the suffix *_view* only for those features 
##             Example: if you have both TP53 in mRNA and mutation data it will be renamed to TP53_mRNA, TP53_mutation
MOFAobject
## Untrained MOFA model with the following characteristics: 
##  Number of views: 3 
##  Views names: view_1 view_2 view_3 
##  Number of features (per view): 100 100 100 
##  Number of groups: 2 
##  Groups names: group_1 group_2 
##  Number of samples (per group): 50 50

4 Define the data, model and training options

Details on the various options can be found in the getting_started vignette. Here, we will simply use the default options, changing only the number of factors:

data_opts <- get_default_data_options(MOFAobject)

model_opts <- get_default_model_options(MOFAobject)
model_opts$num_factors <- 5

train_opts <- get_default_training_options(MOFAobject)

5 Run multiple fitting rounds MOFA object

Once the MOFAobject is set up we can use run_mofa to train the model.
As depending on the random initilization the results might differ, we recommend to use run_mofa multiple times (e.g. ten times, here we use a smaller number for illustration as the model training can take some time) with different random seeds. This allows you to assess the robustness of the inferred factors across different random initilizations and select a model for downstream analysis. As a next step we will show how to compare the different fits and select the best model for downstream analyses.

n_inits <- 3
MOFAlist <- lapply(seq_len(n_inits), function(it) {
  
  # change the seed
  train_opts$seed <- 2019 + it
  
  MOFAobject <- prepare_mofa(MOFAobject, 
    data_options = data_opts,
    model_options = model_opts,
    training_options = train_opts
)
  
  run_mofa(MOFAobject)
})
## Warning in run_mofa(MOFAobject): No output filename provided. Using /tmp/mofa_20191113-164639.hdf5 to store the trained model.
## Warning in run_mofa(MOFAobject): No output filename provided. Using /tmp/mofa_20191113-164644.hdf5 to store the trained model.
## Warning in run_mofa(MOFAobject): No output filename provided. Using /tmp/mofa_20191113-164647.hdf5 to store the trained model.

6 Compare different random inits and select the best model

Having a list of trained models we can use compare_elbo to get an overview of what the optimized ELBO value is (a model with larger ELBO is preferred).

compare_elbo(MOFAlist)
## Plotting the absolute value of the ELBO for every model (the smaller the better)

With compare_factors we can get an overview of how robust the factors are between different model instances.

compare_factors(MOFAlist)

For down-stream analyses we recommned to choose the model with the best ELBO value as is done by select_model.

MOFAobject <- select_model(MOFAlist)
MOFAobject
## Trained MOFA with the following characteristics: 
##  Number of views: 3 
##  Views names: view_1 view_2 view_3 
##  Number of features (per view): 100 100 100 
##  Number of groups: 2 
##  Groups names: group_1 group_2 
##  Number of samples (per group): 50 50 
##  Number of factors: 5

7 SessionInfo

sessionInfo()
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
## 
## Matrix products: default
## BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] MOFA2_0.99.1     BiocStyle_2.12.0
## 
## loaded via a namespace (and not attached):
##  [1] ggrepel_0.8.1       Rcpp_1.0.2          lattice_0.20-38    
##  [4] tidyr_1.0.0         assertthat_0.2.1    zeallot_0.1.0      
##  [7] digest_0.6.22       foreach_1.4.7       plyr_1.8.4         
## [10] R6_2.4.0            backports_1.1.5     stats4_3.6.1       
## [13] evaluate_0.14       ggplot2_3.2.1       pillar_1.4.2       
## [16] rlang_0.4.1         lazyeval_0.2.2      S4Vectors_0.22.1   
## [19] Matrix_1.2-17       reticulate_1.13     rmarkdown_1.16     
## [22] labeling_0.3        BiocParallel_1.18.1 Rtsne_0.15         
## [25] stringr_1.4.0       uwot_0.1.4          pheatmap_1.0.12    
## [28] munsell_0.5.0       DelayedArray_0.10.0 HDF5Array_1.12.3   
## [31] compiler_3.6.1      vipor_0.4.5         xfun_0.10          
## [34] pkgconfig_2.0.3     BiocGenerics_0.30.0 ggbeeswarm_0.6.0   
## [37] htmltools_0.4.0     tidyselect_0.2.5    tibble_2.1.3       
## [40] bookdown_0.14       IRanges_2.18.3      codetools_0.2-16   
## [43] matrixStats_0.55.0  ggpubr_0.2.3        crayon_1.3.4       
## [46] dplyr_0.8.3         grid_3.6.1          jsonlite_1.6       
## [49] gtable_0.3.0        lifecycle_0.1.0     magrittr_1.5       
## [52] scales_1.0.0        RcppParallel_4.4.4  stringi_1.4.3      
## [55] reshape2_1.4.3      ggsignif_0.6.0      doParallel_1.0.15  
## [58] vctrs_0.2.0         cowplot_1.0.0       Rhdf5lib_1.6.3     
## [61] RColorBrewer_1.1-2  iterators_1.0.12    tools_3.6.1        
## [64] forcats_0.4.0       glue_1.3.1          beeswarm_0.2.3     
## [67] purrr_0.3.3         parallel_3.6.1      yaml_2.2.0         
## [70] colorspace_1.4-1    rhdf5_2.28.1        BiocManager_1.30.9 
## [73] corrplot_0.84       knitr_1.25