Here, we introduce the R-package POUMM - an implementation of the Phylogenetic Ornstein-Uhlenbeck Mixed Model (POUMM) for univariate continuous traits (Mitov and Stadler 2017b; Mitov and Stadler 2017a). In the sections below, we demonstrate how the package works. To that end, we run a toy-simulation of a trait according to the POUMM model. Then, we execute a maximum likelihood (ML) and a Bayesian (MCMC) POUMM fit to the simulated data. We show how to use plots and some diagnostics to assess the quality of the fit, i.e. the mixing and the convergence of the MCMC, as well as the consistency of the POUMM fit with the true POUMM parameters from the simulation.

But before we start, we install some packages from CRAN:

install.packages("TreeSim")
install.packages("data.table")
install.packages("ggplot2")
install.packages("lmtest")

Installing the POUMM R-package

You can install the most recent version of the package from github:

devtools::install_github(repo="venelin/POUMM")

The above command will install the HEAD version from the master git-branch. This is the package branch, which evolves the fastest and gets the quickest bug-fixes.

A more stable version of the package can be installed from CRAN:

install.packages("POUMM")

Bear in mind that, for CRAN compliance, some of the functionality of the package might not be enabled on some systems, in particular, parallel likelihood calculation on Mac OS X El Capitan using clang compiler (see section Parallel execution below).

Simulating trait evolution under the POUMM

Parameters of the simulation

First, we specify the parameters of the POUMM simulation:

N <- 500
g0 <- 0           
alpha <- .5        
theta <- 2        
sigma <- 0.2     
sigmae <- 0.2 

We briefly explain the above parameters. The first four of them define an OU-process with initial state \(g_0\), a selection strength parameter, \(\alpha\), a long-term mean, \(\theta\), and a stochastic time-unit standard deviation, \(\sigma\). To get an intuition about the OU-parameters, one can consider random OU-trajectories using the function POUMM::rTrajectoryOU. On the figure below, notice that doubling \(\alpha\) speeds up the convergence of the trajectory towards \(\theta\) (magenta line) while doubling \(\sigma\) results in bigger stochastic oscilations (blue line):

Dashed black and magenta lines denote the deterministic trend towards the long-term mean $\theta$, fixing the stochastic parameter $\sigma=0$.

Dashed black and magenta lines denote the deterministic trend towards the long-term mean \(\theta\), fixing the stochastic parameter \(\sigma=0\).

The POUMM models the evolution of a continuous trait, \(z\), along a phylogenetic tree, assuming that \(z\) is the sum of a genetic (heritable) component, \(g\), and an independent non-heritable (environmental) component, \(e\sim N(0,\sigma_e^2)\). At every branching in the tree, the daughter lineages inherit the \(g\)-value of their parent, adding their own environmental component \(e\). The POUMM assumes the genetic component, \(g\), evolves along each lineage according to an OU-process with initial state the \(g\) value inherited from the parent-lineage and global parameters \(\alpha\), \(\theta\) and \(\sigma\).

Simulating the phylogeny

Once the POUMM parameters are specified, we use the TreeSim R-package (Stadler 2015) to generate a random birth-death tree with 500 tips:

# Number of tips
tree <- TreeSim::sim.bdsky.stt(N, lambdasky = 1.6, deathsky = .6, 
                               timesky=c(0, Inf), sampprobsky = 1)[[1]]

Simulating trait evolution on the phylogeny

Starting from the root value \(g_0\), we simulate the genotypic values, \(g\), and the environmental contributions, \(e\), at all internal nodes down to the tips of the phylogeny:

# genotypic (heritable) values
g <- POUMM::rVNodesGivenTreePOUMM(tree, g0, alpha, theta, sigma)

# environmental contributions
e <- rnorm(length(g), 0, sigmae)

# phenotypic values
z <- g + e

Visualizing the data

In most real situations, only the phenotypic value, at the tips, i.e. will be observable. One useful way to visualize the observed trait-values is to cluster the tips in the tree according to their root-tip distance, and to use box-whisker or violin plots to visualize the trait distribution in each group. This allows to visually assess the trend towards uni-modality and normality of the values - an important prerequisite for the POUMM.

# This is easily done using the nodeTimes utility function in combination with
# the cut-function from the base package.
data <- data.table(z = z[1:N], t = POUMM::nodeTimes(tree, tipsOnly = TRUE))
data <- data[, group := cut(t, breaks = 5, include.lowest = TRUE)]

ggplot(data = data, aes(x = t, y = z, group = group)) + 
  geom_violin(aes(col = group)) + geom_point(aes(col = group), size=.5)
Distributions of the trait-values grouped according to their root-tip distances.

Distributions of the trait-values grouped according to their root-tip distances.

Fitting the POUMM

Once all simulated data is available, it is time proceed with a first POUMM fit. This is done easily by calling the POUMM function:

fitPOUMM <- POUMM::POUMM(z[1:N], tree)

The above code runs for about 5 minutes on a MacBook Pro Retina (late 2013) with a 2.3 GHz Intel Core i7 processor. Using default settings, it performs a maximum likelihood (ML) and a Bayesian (MCMC) fit to the data. First the ML-fit is done. Then, three MCMC chains are run as follows: the first MCMC chain samples from the default prior distribution, i.e. assuming a constant POUMM likelihood; the second and the third chains perform adaptive Metropolis sampling from the posterior parameter distribution conditioned on the default prior and the data. By default each chain is run for \(10^5\) iterations. This and other default POUMM settings are described in detail in the help-page for the function specifyPOUMM (see ).

The strategy of executing three MCMC chains instead of one allows to assess:

  • the quality of the MCMC fit: a mismatch between the sampling distributions of the second and third chains suggests that at least one of the chains has not converged to a region of high posterior density (HPD).
  • the presence of signal for the POUMM parameters in the data: a close match between prior and posterior distributions suggests lack of signal in the data.

We plot traces and posterior sample densities from the MCMC fit:

# get a list of plots 
plotList <- plot(fitPOUMM, showUnivarDensityOnDiag = TRUE, doPlot = FALSE)
plotList$traceplot
MCMC traces from a POUMM MCMC-fit.

MCMC traces from a POUMM MCMC-fit.

plotList$densplot
MCMC univariate density plots. Black dots on the x-axis indicate the ML-fit.

MCMC univariate density plots. Black dots on the x-axis indicate the ML-fit.

A mismatch of the posterior sample density plots from chains 2 and 3, in particular for the phylogenetic heritability, \(H_{\bar{t}}^2\), indicates that the chains have not converged. This can be confirmed quantitatively by the Gelman-Rubin statistic (column called G.R.) in the summary of the fit:

summary(fitPOUMM)
##             stat   N       MLE  PostMean             HPD    ESS  G.R.
##  1:        alpha 500   0.47442   0.47823   0.3127,0.6483 103.94 1.119
##  2:        theta 500   2.04246   2.04767     1.921,2.171 134.72 1.041
##  3:        sigma 500   0.20051   0.20421   0.1527,0.2678 106.16 1.068
##  4:       sigmae 500   0.20521   0.20315   0.1691,0.2399 180.00 1.007
##  5:          H2e 500   0.64114   0.64549   0.5178,0.7633 180.00 1.003
##  6:       H2tInf 500   0.50154   0.51244   0.3266,0.7108 110.66 1.008
##  7:       H2tMax 500   0.50069   0.51113   0.3259,0.7097 110.70 1.008
##  8:      H2tMean 500   0.49921   0.50935   0.3248,0.7079 110.60 1.009
##  9:           g0 500        NA        NA           NA,NA   0.00    NA
## 10: sigmaG2tMean 500   0.04198   0.04410 0.02524,0.06528 117.54 1.020
## 11:  sigmaG2tMax 500   0.04223   0.04443 0.02530,0.06562 118.44 1.019
## 12:  sigmaG2tInf 500   0.04237   0.04467 0.02532,0.06582 119.05 1.019
## 13:      logpost 500        NA -55.74341   -59.56,-53.64  68.76 1.089
## 14:       loglik 500 -53.29615        NA           NA,NA   0.00    NA
## 15:          AIC 500 116.59230        NA           NA,NA   0.00    NA
## 16:         AICc 500 116.71376        NA           NA,NA   0.00    NA
## 17:           g0 500   0.03896        NA           NA,NA   0.00    NA

The G.R. diagnostic is used to check whether two random samples originate from the same distribution. Values that are substantially different from 1.00 (in this case greater than 1.01) indicate significant difference between the two samples and possible need to increase the number of MCMC iterations. Therefore, we rerun the fit specifying that each chain should be run for \(4 \times 10^5\) iterations:

fitPOUMM2 <- POUMM::POUMM(z[1:N], tree, spec=list(nSamplesMCMC = 4e5))  

Now, both the density plots and the G.R. values indicate nearly perfect convergence of the second and third chains. The agreement between the ML-estimates (black dots on the density plots) and the posterior density modes (approximate location of the peak in the density curves) shows that the prior does not inflict a bias on the MCMC sample. The mismatch between chain 1 and chains 2 and 3 suggests that the information about the POUMM parameters contained in the data disagrees with or significantly improves our prior knowledge about these parameters. This is the desired outcome of a Bayesian fit, in particular, in the case of a weak (non-informed) prior, such as the default one.

plotList <- plot(fitPOUMM2, doPlot = FALSE)
plotList$densplot

summary(fitPOUMM2)
##             stat   N       MLE  PostMean             HPD   ESS   G.R.
##  1:        alpha 500   0.47442   0.46639   0.3226,0.6197 720.0 0.9998
##  2:        theta 500   2.04246   2.05993     1.944,2.210 720.0 0.9993
##  3:        sigma 500   0.20051   0.19902   0.1375,0.2615 720.0 0.9997
##  4:       sigmae 500   0.20521   0.20537   0.1701,0.2389 849.4 0.9997
##  5:          H2e 500   0.64114   0.63805   0.5214,0.7593 841.5 0.9998
##  6:       H2tInf 500   0.50154   0.50082   0.3126,0.6998 656.0 0.9998
##  7:       H2tMax 500   0.50069   0.49943   0.3000,0.6878 656.6 0.9998
##  8:      H2tMean 500   0.49921   0.49753   0.2965,0.6850 657.2 0.9998
##  9:           g0 500        NA        NA           NA,NA   0.0     NA
## 10: sigmaG2tMean 500   0.04198   0.04287 0.02287,0.06531 642.3 0.9997
## 11:  sigmaG2tMax 500   0.04223   0.04320 0.02283,0.06533 642.3 0.9997
## 12:  sigmaG2tInf 500   0.04237   0.04345 0.02299,0.06541 642.3 0.9997
## 13:      logpost 500        NA -55.54167   -58.38,-53.60 720.0 0.9990
## 14:       loglik 500 -53.29615        NA           NA,NA   0.0     NA
## 15:          AIC 500 116.59230        NA           NA,NA   0.0     NA
## 16:         AICc 500 116.71376        NA           NA,NA   0.0     NA
## 17:           g0 500   0.03896        NA           NA,NA   0.0     NA

Consistency of the fit with the “true” simulation parameters

The 95% high posterior density (HPD) intervals contain the true values for all five POUMM parameters (\(\alpha\), \(\theta\), \(\sigma\), \(\sigma_e\) and \(g_0\)). This is also true for the derived statistics. To check this, we calculate the true derived statistics from the true parameter values and check that these are well within the corresponding HPD intervals:

tMean <- mean(POUMM::nodeTimes(tree, tipsOnly = TRUE))
tMax <- max(POUMM::nodeTimes(tree, tipsOnly = TRUE))

c(# phylogenetic heritability at mean root-tip distance: 
  H2tMean = POUMM::H2(alpha, sigma, sigmae, t = tMean),
  # phylogenetic heritability at long term equilibirium:
  H2tInf = POUMM::H2(alpha, sigma, sigmae, t = Inf),
  # empirical (time-independent) phylogenetic heritability, 
  H2e = POUMM::H2e(z[1:N], sigmae),
  # genotypic variance at mean root-tip distance: 
  sigmaG2tMean = POUMM::varOU(t = tMean, alpha, sigma),
  # genotypic variance at max root-tip distance: 
  sigmaG2tMean = POUMM::varOU(t = tMax, alpha, sigma),
  # genotypic variance at long-term equilibrium:
  sigmaG2tInf = POUMM::varOU(t = Inf, alpha, sigma)
  )
##      H2tMean       H2tInf          H2e sigmaG2tMean sigmaG2tMean 
##      0.49820      0.50000      0.65914      0.03971      0.03990 
##  sigmaG2tInf 
##      0.04000

Finally, we compare the ratio of empirical genotypic to total phenotypic variance with the HPD-interval for the phylogenetic heritability.

c(H2empirical = var(g[1:N])/var(z[1:N]))
## H2empirical 
##      0.6791
summary(fitPOUMM2)["H2e"==stat, unlist(HPD)]
##  lower  upper 
## 0.5214 0.7593

Parallel execution

On computers with multiple core processors, it is possible to speed-up the POUMM-fit by parallelization. The POUMM package supports parallelization on two levels:

  • parallelizing the POUMM likelihood calculation. The POUMM package parallelizes the likelihood calculation through the SPLiTTree library for parallel lineage traversal (Mitov and Stadler 2017a). This is a fine grain parallelization, which can benefit from modern single instruction multiple data (SIMD) processors as well as multiple physical cores. Parallelization on multiple cores becomes beneficial on trees exceeding several hundreds tips. For this parallelization to work, the POUMM package must be compiled from source-code using an OpenMP 4.0-enabled C++ compiler. Open MP 4.0 is supported by several modern C++ compilers including Gnu-g++ and Intel-icpc, but not the current version of the clang compiler. We have tested the package on Linux and MacOS X El Capitan using the Intel compiler v16.0.0. To that end, before installing the package from source, we modify the global Makevars file found in the directory .R under the user’s home directory:
CPP=cpp
CXX=icpc
CC=icc
SHLIB_CXXLD=icpc -qopenmp

Note that if you are installing the POUMM package from CRAN, on Mac OS X, the above modification of the global Makevars file might still be overwritten by the R-package builder. This is because, for CRAN compliance, the package DESCRIPTION file needs a directive:

SystemRequirements: C++11

On my Mac OS X El Capitan, the above directive forces the use of the clang compiler, which does not support Open MP. Therefore, to enbable using the icpc and the Open MP, I recommend installing the package from github.

To control the maximum number of threads (defaults to all physical or virtual cores on the system) by specifying the environment variable OMP_NUM_THREADS, before starting R e.g.:

export OMP_NUM_THREADS=4
  • parallelizing the MCMC-chains - this can be done by creating a cluster using the R-package parallel. With the default settings of the MCMC-fit (executing two MCMC chains sampling from the posterior distribution and one MCMC chain sampling from the prior), this parallelization can result in about two times speed-up of the POUMM fit on a computer with at least two available physical cores.
# set up a parallel cluster on the local computer for parallel MCMC:
cluster <- parallel::makeCluster(parallel::detectCores(logical = FALSE))
doParallel::registerDoParallel(cluster)

fitPOUMM <- POUMM::POUMM(z[1:N], tree, spec=list(parallelMCMC = TRUE))

# Don't forget to destroy the parallel cluster to avoid leaving zombie worker-processes.
parallel::stopCluster(cluster)

It is possible to use this parallelization in combination with parallelization of the likelihood calculation. This, however, has not been tested and, presumably, would be slower than a parallelization on the likelihood level only. It may be appropriate to parallelize the MCMC chains on small trees, since for small trees, the parallel likelihood calculation is not likely to be much faster than a serial mode calculation. In this case the SPLiTTree library would switch automatically to serial mode, so there will be no parallel CPU at the likelihood level.

Packages used

Apart from base R functionality, the POUMM package uses a number of 3rd party R-packages:

  • For likelihood calculation: Rcpp v0.12.14 (Eddelbuettel et al. 2017), RcppArmadillo v0.8.300.1.0 (Eddelbuettel, Francois, and Bates 2016), Rmpfr v0.6.1 (Maechler 2016);
  • For mcmcSampling: adaptMCMC v1.2 (Scheidegger 2012);
  • For MCMC convergence diagnostics, calculation of MCMC effective sizes and HPD-intervals: coda v0.19.1 (Plummer et al. 2016);
  • For other purposes (parameter transformations and summary statistics): parallel v3.3.3 (Team, n.d.), foreach v1.4.4 (Revolution Analytics and Weston, n.d.), data.table v1.10.4.3 (Dowle and Srinivasan 2016), Matrix v1.2.8 (Bates and Maechler 2017), gsl v1.9.10.3 (Duncan Murdoch and Andrew Clausen 2017));
  • For tree processing: ape v5.0 (Paradis et al. 2016);
  • For reporting: data.table v1.10.4.3 (Dowle and Srinivasan 2016), ggplot2 v2.2.1 (Wickham and Chang 2016), GGally v1.3.2 (Schloerke et al. 2016), lmtest v0.9.35 (Hothorn et al. 2015);
  • For testing: testthat v1.0.2 (Wickham 2016), mvtnorm v1.0.5 (Genz et al. 2016), TreeSim v2.3 (Stadler 2015).

References

Bates, Douglas, and Martin Maechler. 2017. Matrix: Sparse and Dense Matrix Classes and Methods. https://CRAN.R-project.org/package=Matrix.

Dowle, Matt, and Arun Srinivasan. 2016. Data.table: Extension of ‘Data.frame‘. https://CRAN.R-project.org/package=data.table.

Duncan Murdoch, Robin K. S. Hankin; qrng functions by, and multimin by Andrew Clausen. 2017. Gsl: Wrapper for the Gnu Scientific Library. https://CRAN.R-project.org/package=gsl.

Eddelbuettel, Dirk, Romain Francois, JJ Allaire, Kevin Ushey, Qiang Kou, Nathan Russell, Douglas Bates, and John Chambers. 2017. Rcpp: Seamless R and C++ Integration. https://CRAN.R-project.org/package=Rcpp.

Eddelbuettel, Dirk, Romain Francois, and Doug Bates. 2016. RcppArmadillo: ’Rcpp’ Integration for the ’Armadillo’ Templated Linear Algebra Library. https://CRAN.R-project.org/package=RcppArmadillo.

Genz, Alan, Frank Bretz, Tetsuhisa Miwa, Xuefei Mi, and Torsten Hothorn. 2016. Mvtnorm: Multivariate Normal and T Distributions. https://CRAN.R-project.org/package=mvtnorm.

Hothorn, Torsten, Achim Zeileis, Richard W. Farebrother, and Clint Cummins. 2015. Lmtest: Testing Linear Regression Models. https://CRAN.R-project.org/package=lmtest.

Maechler, Martin. 2016. Rmpfr: R Mpfr - Multiple Precision Floating-Point Reliable. https://CRAN.R-project.org/package=Rmpfr.

Mitov, Venelin, and Tanja Stadler. 2017a. “Fast Bayesian Inference of Phylogenetic Models Using Parallel Likelihood Calculation and Adaptive Metropolis Sampling.” bioRxiv. Cold Spring Harbor Laboratory. doi:10.1101/235739.

———. 2017b. “Fast and Robust Inference of Phylogenetic Ornstein-Uhlenbeck Models Using Parallel Likelihood Calculation.” bioRxiv, mai, 115089.

Paradis, Emmanuel, Simon Blomberg, Ben Bolker, Julien Claude, Hoa Sien Cuong, Richard Desper, Gilles Didier, et al. 2016. Ape: Analyses of Phylogenetics and Evolution. https://CRAN.R-project.org/package=ape.

Plummer, Martyn, Nicky Best, Kate Cowles, Karen Vines, Deepayan Sarkar, Douglas Bates, Russell Almond, and Arni Magnusson. 2016. Coda: Output Analysis and Diagnostics for Mcmc. https://CRAN.R-project.org/package=coda.

Revolution Analytics, and Steve Weston. n.d. Foreach: Provides Foreach Looping Construct for R.

Scheidegger, Andreas. 2012. AdaptMCMC: Implementation of a Generic Adaptive Monte Carlo Markov Chain Sampler. https://CRAN.R-project.org/package=adaptMCMC.

Schloerke, Barret, Jason Crowley, Di Cook, Francois Briatte, Moritz Marbach, Edwin Thoen, Amos Elberg, and Joseph Larmarange. 2016. GGally: Extension to ’Ggplot2’. https://CRAN.R-project.org/package=GGally.

Stadler, Tanja. 2015. TreeSim: Simulating Phylogenetic Trees. https://CRAN.R-project.org/package=TreeSim.

Team, R Core. n.d. Support for Parallel Computation in R.

Wickham, Hadley. 2016. Testthat: Unit Testing for R. https://CRAN.R-project.org/package=testthat.

Wickham, Hadley, and Winston Chang. 2016. Ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics. https://CRAN.R-project.org/package=ggplot2.