Below are a few examples of common statistical models implemented in greta.
A simple, one-variable Bayesian linear regression model using the attitude data
# variables & priors
int <- normal(0, 10)
coef <- normal(0, 10)
sd <- cauchy(0, 3, truncation = c(0, Inf))
# linear predictor
mu <- int + coef * attitude$complaints
# observation model
distribution(attitude$rating) <- normal(mu, sd)
A multi-variable Bayesian linear regression model using the attitude data
# the predictors as a matrix
design <- as.matrix(attitude[, 2:7])
int <- normal(0, 10)
coefs <- normal(0, 10, dim = ncol(design))
sd <- cauchy(0, 3, truncation = c(0, Inf))
# matrix multiplication is more efficient than multiplying the coefficients
# separately
mu <- int + design %*% coefs
distribution(attitude$rating) <- normal(mu, sd)
A hierarchical, Bayesian linear regression model using the iris data, with random intercepts for each of the three species.
# linear model parameters
int <- normal(0, 10)
coef <- normal(0, 10)
sd <- cauchy(0, 3, truncation = c(0, Inf))
# hierarchical model for species effect; use the first species as the baseline
# like in lm()
species_sd <- lognormal(0, 1)
species_offset <- normal(0, species_sd, dim = 2)
species_effect <- rbind(0, species_offset)
species_id <- as.numeric(iris$Species)
# model
mu <- int + coef * iris$Sepal.Width + species_effect[species_id]
distribution(iris$Sepal.Length) <- normal(mu, sd)
The BUGS project provide a number of example models written in the BUGS modelling language. These models will run in WinBUGS and OpenBUGS, and likely also in JAGS. The Stan wiki provides Stan implementations of these models.
The following sections provide greta implementations of some of these example models, alongside the BUGS code from WinBUGS examples volume 2 (pdf) and Stan code and an R version of the data from the Stan example models wiki.
Air analyses reported respiratory illness versus exposure to nitrogen dioxide in 103 children. The parameters alpha
, beta
and sigma2
are known in advance, and the data are grouped into three categories.
See WinBUGS examples volume 2 (pdf) for details.
y <- c(21, 20, 15)
n <- c(48, 34, 21)
Z <- c(10, 30, 50)
alpha <- 4.48
beta <- 0.76
sigma2 <- 81.14
sigma <- sqrt(sigma2)
tau <- 1 / sigma2
J <- 3
theta <- normal(0, 32, dim = 2)
mu <- alpha + beta * Z
X <- normal(mu, sigma)
p <- ilogit(theta[1] + theta[2] * X)
distribution(y) <- binomial(n, p)
for(j in 1 : J) {
y[j] ~ dbin(p[j], n[j])
logit(p[j]) <- theta[1] + theta[2] * X[j]
X[j] ~ dnorm(mu[j], tau)
mu[j] <- alpha + beta * Z[j]
}
theta[1] ~ dnorm(0.0, 0.001)
theta[2] ~ dnorm(0.0, 0.001)
data {
real alpha;
real beta;
real<lower=0> sigma2;
int<lower=0> J;
int y[J];
vector[J] Z;
int n[J];
}
transformed data {
real<lower=0> sigma;
sigma <- sqrt(sigma2);
}
parameters {
real theta1;
real theta2;
vector[J] X;
}
model {
real p[J];
theta1 ~ normal(0, 32); // 32^2 = 1024
theta2 ~ normal(0, 32);
X ~ normal(alpha + beta * Z, sigma);
y ~ binomial_logit(n, theta1 + theta2 * X);
}
Beetles considers dose-response data from an experiment applying carbon disulphide to 8 beetles. The original example compares three different link functions; the logit, probit and complementary log-log. Here, only the code for the logit link is shown. You can implement the other two link functions in greta by changing ilogit
to iprobit
or icloglog
.
See WinBUGS examples volume 2 (pdf) for details.
x <- c(1.6907, 1.7242, 1.7552, 1.7842, 1.8113, 1.8369, 1.8610, 1.8839)
n <- c(59, 60, 62, 56, 63, 59, 62, 60)
r <- c(6, 13, 18, 28, 52, 53, 61, 60)
N <- 8
alpha_star <- normal(0, 32)
beta <- normal(0, 32)
p <- ilogit(alpha_star + beta * (x - mean(x)))
distribution(r) <- binomial(n, p)
alpha <- alpha_star - beta * mean(x)
rhat <- p * n
for( i in 1 : N ) {
r[i] ~ dbin(p[i],n[i])
logit(p[i]) <- alpha.star + beta * (x[i] - mean(x[]))
rhat[i] <- n[i] * p[i]
culmative.r[i] <- culmative(r[i], r[i])
}
alpha <- alpha.star - beta * mean(x[])
beta ~ dnorm(0.0,0.001)
alpha.star ~ dnorm(0.0,0.001)
data {
int<lower=0> N;
int<lower=0> n[N];
int<lower=0> r[N];
vector[N] x;
}
transformed data {
vector[N] centered_x;
real mean_x;
mean_x <- mean(x);
centered_x <- x - mean_x;
}
parameters {
real alpha_star;
real beta;
}
transformed parameters {
vector[N] m;
m <- alpha_star + beta * centered_x;
}
model {
alpha_star ~ normal(0.0, 1.0E4);
beta ~ normal(0.0, 1.0E4);
r ~ binomial_logit(n, m);
}
generated quantities {
real alpha;
real p[N];
real llike[N];
real rhat[N];
for (i in 1:N) {
p[i] <- inv_logit(m[i]);
llike[i] <- r[i]*log(p[i]) + (n[i]-r[i])*log(1-p[i]);
rhat[i] <- p[i]*n[i]; // fitted values
}
alpha <- alpha_star - beta*mean_x;
}
Here we provide some examples of common ecological models. We begin with a basic logistic regression often used in species distribution modelling to estimate species probability of presence. We then provide increasingly complex species distribution models, beginning with modelling obervation error directly, and moving on to models for multiple species: independentally but concurrently modelled species, partially pooled coefficients, repeated measures, and sub-models.
One important way greta
models differs from BUGS and Stan code is that it is often neccessary to set up the linear predictor seperatly to the logistic transformation. This is particulary the case when the dimensions of the data increases, such as when there are multiple species being modelled. We keep this convention in the simpler models for ease of understanding.
Logistic regression This is an example of a simple logistic regression being used to estimate the probability of species presence along a number of environmental gradients. We first simulate some data to model followed by the greta
code.
# make fake data
n_env <- 3
n_sites <- 20
# n_sites x n_env matrix of environmental variables
env <- matrix(rnorm(n_sites * n_env), nrow = n_sites)
# n_sites observations of species presence or absence
occupancy <- rbinom(n_sites, 1, 0.5)
# load greta
library(greta)
# create matrices to greta arrays
X <- as_data(env)
Y <- as_data(occupancy)
# create greta arrays for random variables
alpha <- normal(0, 10)
beta <- normal(0, 10, dim = n_env)
# logit-linear model
linear_predictor <- alpha + X %*% beta
p <- ilogit(linear_predictor)
# distribution (likelihood) over observed values
distribution(Y) <- bernoulli(p)
An example of a simple poisson regression being used to estimate the abundance of a species along a number of environmental gradients. We first simulate some data to model followed by the greta
code.
# make fake data
n_env <- 3
n_sites <- 20
# n_sites x n_env matrix of environmental variables
env <- matrix(rnorm(n_sites * n_env), nrow = n_sites)
# n_sites observations of species abundance
occupancy <- rpois(n_sites, 5)
# load greta
library(greta)
# create matrices to greta arrays
X <- as_data(env)
Y <- as_data(occupancy)
# create greta arrays for random variables
alpha <- normal(0, 10)
beta <- normal(0, 10, dim = n_env)
# model
linear_predictor <- alpha + X %*% beta
lambda <- log(linear_predictor)
distribution(Y) <- poisson(lambda)
Logistic regression with observational error term This is an example of a simple logistic regression with an extra observation-level error term We first simulate some data to model followed by the greta
code.
# make fake data
n_env <- 3
n_sites <- 20
# n_sites x n_env matrix of environmental variables
env <- matrix(rnorm(n_sites * n_env), nrow = n_sites)
# n_sites observations of species presence or absence
occupancy <- rbinom(n_sites, 1, 0.5)
# load greta
library(greta)
# create matrices to greta arrays
X <- as_data(env)
Y <- as_data(occupancy)
# create greta arrays for random variables
alpha <- normal(0, 10)
beta <- normal(0, 10, dim = n_env)
error <- normal(0, 10, dim = n_sites)
# logit-linear model with extra variation
linear_predictor <- alpha + X %*% beta + error
p <- ilogit(linear_predictor)
# distribution (likelihood) over observed values
distribution(Y) <- bernoulli(p)
An example of a logistic regression being used to estimate the probability of multiple species’ presences along a number of environmental gradients. Although modelled concurrently, the random variables for each species are independent. We first simulate some data to model followed by the greta
code.
Where a single observation per species and location would have a bernoulli error distribution, multiple observations for each species and location have a binomial distribution. The small change in the code is commented below.
When modelling multiple species (or other grouping factor), we need an extra step in constructing the linear predictor. In order to add multiple greta
arrays together for each species we can use the sweep()
function.
# make fake data
n_species <- 5
n_env <- 3
n_sites <- 20
env <- matrix(rnorm(n_sites * n_env), nrow = n_sites)
occupancy <- matrix(rbinom(n_species * n_sites, 1, 0.5), nrow = n_sites)
# load greta
library(greta)
# create matrices to greta arrays
X <- as_data(env)
Y <- as_data(occupancy)
# variables
alpha <- normal(0,10, dim = n_species)
beta <- normal(0, 10, dim = c(n_env, n_species))
env_effect <- X %*% beta
# matrix addition with `sweep()` create interim variable
linear_predictor <- sweep(env_effect, 2, alpha, FUN = '+')
# ilogit of linear predictor
p <- ilogit(linear_predictor)
# a single observation means our data are bernoulli distributed
distribution(Y) <- bernoulli(p)
# for n_obs repeat observations per species/location, we could use the binomial
# distribution:
# distribution(Y) <- binomial(n_obs, p)
An example of a logistic regression being used to estimate the probability of multiple species’ presences along a number of environmental gradients. Instead of assuming independence of species regression coefficients, we assume they are drawn from a shared distribution. We partially pool species responses. This gives us not ony the regression coefficients for each species but also a global average coefficient and a measure of variation between species responses to environmental gradients.
We first simulate some data to model followed by the greta
code.
# make fake data
n_species <- 5
n_env <- 1
n_sites <- 50
env <- matrix(rnorm(n_sites * n_env), nrow = n_sites)
occupancy <- matrix(rbinom(n_sites * n_species, 1, 0.5), nrow = n_sites)
# load greta
library(greta)
# create matrices to greta arrays
X <- as_data(env)
Y <- as_data(occupancy)
# variables
global_alpha <- normal(0, 10, dim = 1)
global_alpha_sd <- uniform(0, 10, dim = 1)
alpha <- normal(global_alpha, global_alpha_sd, dim = n_species)
global_betas <- normal(0, 10, dim = n_env)
global_betas_sd <- uniform(0, 10, dim = n_env)
beta <- normal(global_betas, global_betas_sd, dim = c(n_env, n_species))
env_effect <- X %*% beta
# matrix addition with `sweep()` create interim variable
linear_predictor <- sweep(env_effect, 2, alpha, FUN = '+')
# ilogit of linear predictor
p <- ilogit(linear_predictor)
# a single observation means our data are bernoulli distributed
distribution(Y) <- bernoulli(p)
An example of a logistic regression being used to estimate the probability of multiple species’ presences along a number of environmental gradients. Instead of assuming independence of species regression coefficients, or partial pooling in shared distributions, we use a sub-model to estimate species regression coefficients. In this case, we’re using species traits to estimate their response to different environmental gradients.
Because we’re building a sub-model, it’s more efficient to simply add a coloumn of ones to dataframes for the base model and sub-model. This is simply to prevent our code from becoming too cumbersome. If we didn’t want to use our sub-model to estimate the intercept, we would not need to include the column of ones in the environmental dataframe.
We first simulate some data to model followed by the greta
code.
# make fake data
n_species <- 3
n_env <- 1
n_sites <- 5
n_traits <- 1
# n_sites x n_env matrix of environmental variables
env <- matrix(rnorm(n_sites * n_env), nrow = n_sites)
# n_species * n_traits matix of trait variables
traits <- matrix(rnorm(n_species * n_traits), nrow = n_species)
# n_sites * n_species matrix of observed occupancy
occupancy <- matrix(rbinom(n_sites * n_species, 1, 0.5), nrow = n_sites)
# load greta
library(greta)
# data wrangling
# include a column of 1's for intercept estimation in the sub-model (traits) and base model
traits <- cbind(rep(1, n_species), traits)
env <- cbind(rep(1, n_sites), env)
# redefine the n_env and n_traits after adding in coloum of 1's for intercepts
n_env <- ncol(env)
n_traits <- ncol(traits)
# create matrices to greta arrays
X <- as_data(env)
Y <- as_data(occupancy)
U <- as_data(traits)
# greta arrays for variables to be estimated
# sub-model parameters have normal prior distributions
g <- normal(0, 10, dim = c(n_env, n_traits))
# parameters of the base model are a function of the parameters of the sub-model
beta <- g %*% t(U)
# use the coefficients to get the model linear predictor
linear_predictor <- X %*% beta
# use the logit link to get probabilities of occupancy
p <- ilogit(linear_predictor)
# data are bernoulli distributed
distribution(Y) <- bernoulli(p)