It is very common in cognitive science and psychology to use experimental tasks that involve making a fast choice among a restricted number of alternatives. While a class of choice models, the so-called sequential-sampling models, can give an accurate account of the relationship between the accuracy of the choice and the time it took to respond, it is fairly common to ignore the tradeoff between accuracy and response times and analyze them as if they were independent. The reason for this is that sequential-sampling models are mathematically and computationally difficult to apply. In this notebook, I focus on one influential and relatively simple model that belongs to the class of sequential-sampling models: the linear ballistic accumulator with a drift rate drawn from a normal distribution (restricted to positive values) (S. D. Brown and Heathcote 2008; Heathcote and Love 2012). Even though this model has been proved to be well-suited for tasks that elicit a speeded response (out of any number of possible choices), its hierarchical version is difficult to implement and fit. First, I discuss the motivation for fitting this model using the Stroop task (Stroop 1935) as a case study. Then, I discuss the challenges in the implementation of the model in (R)Stan (Stan Development Team 2017), which might also apply to other hierarchical models with complex likelihood functions. Finally, I show some results that exemplify how the linear ballistic accumulator can be used for examining individual differences.
Examining fast decisions that do not require long deliberation can uncover automatic cognitive processes, since these decisions are taken before more complex strategies have time to kick in. This is a reason for the popularity of tasks that involve making decisions under time pressure in cognitive science and psychology; some examples are the Stroop task (Stroop 1935), Eriksen flanker task (B. A. Eriksen and Eriksen 1974), lexical decision tasks, speeded acceptability judgments, and so forth. This type of task is characterized by yielding two measures: the accuracy of the decision made and the time it took to respond. These two measures, however, are not independent since participants can answer faster sacrificing accuracy, or be more cautious decreasing response times but increasing accuracy.
I illustrate the problems of ignoring the relationship between response times and accuracy using the Stroop task as a case study. I will focus on a subset of the data of 3337 participants that undertook one variant of the Stroop task, as part of the battery of tasks run in Ebersole et al. (2016). There are many variations of the Stroop task, in this particular case, participants were presented with one word at the center of the screen, which was either “red”, “blue”, and “green” (word
) written in either red, blue, or green font (color
). In one third of the trials the word matched the color of the text (“congruent” condition) and in the rest of the trials it did not match (“incongruent” condition). Participants were instructed to only pay attention to the color, and press 1
if the color of the word was red, 2
if it was blue, and 3
if it was green. The Stroop effect, that is, the difficulty in identifying the color when it mismatches the word in the incongruent condition (e.g., green in color blue) in comparison to a baseline condition, here, the congruent condition (e.g., green in color green) is extremely robust across variations of the task. The most robust finding is that correct responses in the incongruent condition take longer than in the congruent condition (or any baseline condition).
To speed up computation, I subset 25 participants of the original dataset (containing 3337 participants).
set.seed(42)
library(tidyverse)
library(magrittr)
library(stringr)
library(ggplot2)
dstroop <- read_csv("data/StroopCleanSet.csv") %>%
# rename unintuitive column names:
rename(correct= trial_error,
RT = trial_latency,
condition = congruent) %>%
# Create word and color column names
mutate(subject = session_id %>% as.factor %>% as.numeric,
word = str_match(trial_name,"(.*?)\\|")[,2] %>%
str_to_lower,
color = str_match(block_name,"(.*?)\\|")[,2] %>%
str_to_lower,
cond = if_else(condition == "Congruent", -1, 1))
# To speed thing up, I select the first 25 subjects out
# the 3337 of the original study.
dstroop_25 <- dstroop %>% filter(subject <= 25)
dstroop_25 %>% filter(correct==1) %>%
group_by(condition) %>%
summarize(mean(RT))
I show the estimate of the delay between incongruent and congruent conditions calculated with a log-linear mixed model using the brms
package (Bürkner 2017):
library(brms)
options(mc.cores = parallel::detectCores())
priors <- c(set_prior("normal(0,10)", class = "Intercept"),
set_prior("normal(0,2)", class = "b"),
set_prior("normal(0,1)", class = "sigma"),
set_prior("lkj(2)", class = "cor"),
set_prior("normal(0,1)", class = "sd"))
m_rt <- brm(RT ~ cond + (cond | subject), filter(dstroop_25, correct==1),
family=lognormal(), prior = priors, control = list (adapt_delta = .9))
m_rt
## Family: lognormal
## Links: mu = identity; sigma = identity
## Formula: RT ~ cond + (cond | subject)
## Data: filter(dstroop_25, correct == 1) (Number of observations: 1515)
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup samples = 4000
## ICs: LOO = NA; WAIC = NA; R2 = NA
##
## Group-Level Effects:
## ~subject (Number of levels: 25)
## Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## sd(Intercept) 0.12 0.02 0.09 0.16 861 1.01
## sd(cond) 0.02 0.01 0.00 0.04 1607 1.00
## cor(Intercept,cond) 0.38 0.37 -0.50 0.91 3455 1.00
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## Intercept 6.34 0.02 6.29 6.39 670 1.00
## cond 0.04 0.01 0.02 0.06 4000 1.00
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## sigma 0.29 0.01 0.28 0.30 4000 1.00
##
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample
## is a crude measure of effective sample size, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
Below I show the effect on milliseconds:
estimates_rt <- posterior_samples(m_rt, pars= c("b") ) %>%
# Stroop effect in milliseconds:
mutate(stroop_effect_ms = exp(b_Intercept + b_cond) -
exp(b_Intercept - b_cond))
mean(estimates_rt$stroop_effect_ms)
## [1] 44.99178
quantile(estimates_rt$stroop_effect_ms,c(0.025,.5,.975))
## 2.5% 50% 97.5%
## 25.20368 44.74390 65.74738
library(bayesplot)
mcmc_hist(estimates_rt, pars="stroop_effect_ms") + xlab("Difference in response time in milliseconds.")
Figure 1. The figure shows the estimate of the difference in response times between incongruent and congruent conditions.
Even though the accuracy is almost at ceiling in both congruent and incongruent conditions, it is generally found that accuracy in incongruent conditions is lower than in congruent conditions. This is also the case for the subsetted data, as it is shown in the following hierarchical logistic regression:
dstroop_25 %>% group_by(condition) %>%
summarize(mean(correct))
priors <- c(set_prior("normal(0,10)", class = "Intercept"),
set_prior("normal(0,2)", class = "b"),
set_prior("lkj(2)", class = "cor"),
set_prior("normal(0,1)", class = "sd"))
m_acc <- brm(correct ~ cond + (cond | subject), data = dstroop_25,
family = bernoulli(), prior = priors,
control = list(adapt_delta = .9))
m_acc
## Family: bernoulli
## Links: mu = logit
## Formula: correct ~ cond + (cond | subject)
## Data: dstroop_25 (Number of observations: 1575)
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup samples = 4000
## ICs: LOO = NA; WAIC = NA; R2 = NA
##
## Group-Level Effects:
## ~subject (Number of levels: 25)
## Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## sd(Intercept) 0.67 0.27 0.10 1.21 1051 1.00
## sd(cond) 0.40 0.25 0.02 0.93 1169 1.00
## cor(Intercept,cond) -0.07 0.42 -0.78 0.75 4000 1.00
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## Intercept 3.72 0.27 3.26 4.32 2257 1.00
## cond -0.47 0.22 -0.94 -0.07 4000 1.00
##
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample
## is a crude measure of effective sample size, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
Below I show the effect on percentages:
b_acc <- posterior_samples(m_acc, pars= "b" ) %>%
# Stroop effect in percentage:
mutate(stroop_effect_per = (plogis(b_Intercept + b_cond) -
plogis(b_Intercept - b_cond)) * 100)
mean(b_acc$stroop_effect_per)
## [1] -2.192818
quantile(b_acc$stroop_effect_per,c(0.025,.5,.975))
## 2.5% 50% 97.5%
## -4.0638934 -2.1773516 -0.3495766
mcmc_hist(b_acc, pars="stroop_effect_per") + xlab("Difference in accuracy in percentage.")
Figure 2. The figure shows the estimate of the difference in accuracy (in percentage) between incongruent and congruent conditions.
For this specific dataset, it does not matter whether we choose response times or accuracy as our dependent variable (since both of them yield the same result). However, things get more complicated if we want to look at the individual differences between the participants. One reason to do this is that Stroop effect is generally linked to cognitive control or executive functions (e.g., Botvinick et al. 2001), and cognitive control seems to play a role in a variety of phenomena, such as cognitive development, age-related decline in cognitive abilities (Friedman et al. 2008), language-related abilities (Novick, Trueswell, and Thompson-Schill 2010), attention-deficit/hyperactivity disorder (Lansbergen, Kenemans, and Van Engeland 2007), to name a few. However, here, the way we would rank the participants on cognitive control would depend on whether we choose response times or accuracy. For example, participant 1
seems to show a longer difference between conditions in response times, but a smaller difference in accuracy than participant 10
.
indiv_CC <- tibble(subject=1:25,
CC_ms = exp(fixef(m_rt)["Intercept", "Estimate"] +
fixef(m_rt)["cond", "Estimate"] +
ranef(m_rt)$subject[, "Estimate", "Intercept"] +
ranef(m_rt)$subject[, "Estimate", "cond"]) -
exp(fixef(m_rt)["Intercept", "Estimate"] +
- fixef(m_rt)["cond", "Estimate"] +
ranef(m_rt)$subject[, "Estimate", "Intercept"] +
- ranef(m_rt)$subject[, "Estimate", "cond"]),
`CC_%` = (plogis(fixef(m_acc)["Intercept", "Estimate"] +
fixef(m_acc)["cond", "Estimate"] +
ranef(m_acc)$subject[, "Estimate", "Intercept"] +
ranef(m_acc)$subject[, "Estimate", "cond"]) -
plogis(fixef(m_acc)["Intercept", "Estimate"] +
- fixef(m_acc)["cond", "Estimate"] +
ranef(m_acc)$subject[, "Estimate", "Intercept"] +
- ranef(m_acc)$subject[, "Estimate", "cond"]))* 100)
indiv_CC %>% print(n = 25)
## # A tibble: 25 x 3
## subject CC_ms `CC_%`
## <int> <dbl> <dbl>
## 1 1 79.2 -1.07
## 2 2 73.3 -8.19
## 3 3 25.3 -1.80
## 4 4 48.9 -1.73
## 5 5 46.9 -2.55
## 6 6 48.3 -1.10
## 7 7 65.1 -1.75
## 8 8 39.9 -2.50
## 9 9 58.9 -3.49
## 10 10 26.1 -5.35
## 11 11 32.0 -2.23
## 12 12 41.6 -0.744
## 13 13 34.2 -1.44
## 14 14 37.9 -2.55
## 15 15 37.0 -1.78
## 16 16 49.3 -1.05
## 17 17 39.3 -3.48
## 18 18 79.3 -7.76
## 19 19 38.3 -1.41
## 20 20 32.1 -2.54
## 21 21 36.9 -1.05
## 22 22 27.5 -2.54
## 23 23 48.6 -2.55
## 24 24 61.4 -0.850
## 25 25 46.0 -1.75
Furthermore, the following plot shows very clearly that the ranking according to response times is not correlated with the ranking according to accuracy.
indiv_CC$rank_ms <- rank (indiv_CC$CC_ms)
indiv_CC$`rank_%` <- rank(indiv_CC$`CC_%`)
ggplot(aes(x=rank_ms, y=`rank_%`,label=subject),data=indiv_CC)+geom_text()
Figure 3. The figure shows the ranking of the participants according to their response times (x axis) and their accuracy (y axis).
How can we assess the differences in in the cognitive control of the participants?
This problem can be resolved by fitting a model that can account for how accuracy and response times scale relative to each other. There is a number of models, called sequential-sampling models, that can account for these two dependent variables simultaneously. The idea behind these models is that a decision is made based on the accumulation of noisy evidence until it reaches a decision criterion or threshold representing a choice. This type of model includes the drift diffusion model (Ratcliff 1978; Ratcliff 1981), the leaky competing accumulator model (Usher and McClelland 2001), the linear ballistic accumulator (LBA: S. D. Brown and Heathcote 2008; Heathcote and Love 2012) and its variations (Terry et al. 2015), and others (see Ratcliff et al. 2016 for a short overview with focus on the drift diffusion model).
With some exceptions, the sequential-sampling models have been mainly used in very simple tasks (such as deciding whether dots move the right or the left) with the objective of theorizing about the psychological processes that govern decision making itself or their connection to neurological processes (e.g., Forstmann et al. 2008). Decision making processes have been the focus of detailed study for over half a century starting with Stone (1960), and theories of choice have been usually tested on their ability to accommodate empirical patterns of accuracy and response times.
However, while sequential-sampling models are also models of the influence of variables of interest onto the decision-making process and have the added benefit that their parameters have a psychological interpretation, with some exceptions (e.g., Tillman et al. 2017) they have been rarely used in this way.
One of the reasons for this is that there are not always analytical solutions for these models, but even when they exist (as for the drift diffusion model and all the variants of the LBA), parameter estimation is relatively demanding, in terms of effort of the user and the amount of data needed. These models have a relatively large number of parameters that tend to be correlated, and while non-hierarchical models are easier and faster to fit, they require fitting the model to individual subjects, which might entail a prohibitively large number of trials per participant.
While there are Bayesian hierarchical implementations of some sequential-sampling models, these are based on hand-crafted samplers and tailored to specific experiments (e.g., Rouder et al. 2014; Tillman et al. 2017). To my knowledge, two exceptions exists: (i) a Stan implementation of the LBA (Annis, Miller, and Palmeri 2017), and (ii) a brms
function (that using Stan) allows for the estimation of a 4-parameter diffusion model (Wabersich and Vandekerckhove 2014). One problem with the Stan implementation of the LBA is that it includes strong assumptions that cannot be relaxed in a straightforward way, such as shared parameters for all the accumulators, and drift drawn from a normal distribution (rather than from a truncated normal distribution). One limitation of the diffusion model, in comparison with the LBA, is that it is able to fit only tasks with two responses. In addition, in my experience, both implementations can show convergence problems in realistic data sets.
The motivation for fitting the LBA (S. D. Brown and Heathcote 2008) with drifts drawn from a truncated normal distribution restricted to positive values (Heathcote and Love 2012) is that while it is a greatly simplified instance of sequential sampling, it can still account for response time distribution shapes and speed–accuracy tradeoffs and it has analytic solutions for the predicted distributions and probabilities. In contrast to Ratcliff’s drift diffusion model, the LBA can be fit to any number of responses and it typically requires less parameters.
The LBA model represents a choice between a number of alternatives, \(N\), using \(N\) different evidence accumulators, one for each response. In the case of the Stroop task previously presented, for example, there would be three accumulators, one for the choice of “red”, one for “blue”, and one for “green”. Each accumulator begins each trial, \(i\), with a starting amount of evidence \(p_{i,\{red,blue,green\}}\) that increases at a speed given by each “drift rate”, \(v_{i,\{red,blue,green\}}\). The first accumulator to reach its threshold of evidence, \(b_{i,\{red,blue,green\}}\), determines the response given, and the time taken to reach the threshold determines the response time together with some extra time for non-decision processes, \(t_0\).
Two sources of between-trial randomness make the process non-deterministic:
\[\begin{align} p_{i,\{\text{red},\text{blue},\text{green}\}} &\sim Uniform(0, A_{i,\{\text{red},\text{blue},\text{green}\}}) \label{eq:k} \\ v_{i,\{\text{red},\text{blue},\text{green}\}} &\sim Normal_+(\nu_{i,\{\text{red},\text{blue},\text{green}\}}, s_{i,\{\text{red},\text{blue},\text{green}\}}) \label{eq:v} \end{align}\]The variant of the LBA presented here assumes that drifts are drawn from a truncated normal distribution with mean \(\nu\) (the Greek letter nu) and standard deviation \(s\). The time to reach the threshold is the distance to be traveled, \(b-p\), divided by rate of travel, \(v\). The response time is determined by the time of the fastest accumulator together with the extra time \(t_0\). The main motivation for sampling the drifts from a truncated normal distributions is to obtain decision times that are always positive, and to keep mathematical tractability. (Other variants of the LBA that assume Gamma, Lognormal, and Frechet distributions also exist; Terry et al. (2015), but their cumulative function and probability density functions are much more complex.)
Figure 4. The figure depicts the parameters for one linear ballistic accumulator.
It is easy to see how the LBA would work by implementing a function that generates outputs from a race between accumulators (rLBA
). We use this function in section 2.3.1 to generate synthetic data.
library(extraDistr)
rLBA <- function(n, nu, A, b, s, t_0){
# Load extra distributions for the truncated normal distribution
# Number of accumulators
ncols <- c(ncol(nu), ncol(A), ncol(b), ncol(s), ncol(t_0))
N_acc <- ifelse(!is.null(ncols), max(ncols),
# In case there are no matrices
max(length(nu), length(A), length(b),
length(s), length(t_0)))
# Validation of the parameters
validate <- function(par){
if(is.matrix(par)) {
#nothing
} else if(length(par) == 1) {
par <- matrix(rep(par, n * N_acc), ncol = N_acc, byrow = TRUE)
} else if(length(par) == N_acc) {
par <- matrix(rep(par, n), ncol = N_acc, byrow = TRUE)
} else {
stop("Error wrong size of a parameter")
}
return(par)
}
nu <- validate(nu)
A <- validate(A)
b <- validate(b)
s <- validate(s)
t_0 <- validate(t_0)
# Draws matrix of drift rates v
v <- rtnorm(n * N_acc, nu, s, a = 0) %>%
matrix(nrow = n,ncol = N_acc)
# Draws matrix of initial state of evidence
p <- runif(n * N_acc, 0, A) %>%
matrix(nrow = n,ncol = N_acc)
# distance
d <- b - p
# Matrix of times taken to reach the threshold
# for each accumulator
rt_acc <- d/v + t_0
# Time taken for the fastest accumulator in each trial
RT <- apply(rt_acc, 1, min)
# The fastest accumulator in each trial determines the response
response <- apply(rt_acc, 1, which.min)
out <- tibble(RT = RT, response = response)
return(out)
}
S. D. Brown and Heathcote (2008) and Heathcote and Love (2012) show that the cumulative distribution function, \(F\), and probability density function, \(f\), for the finishing time, \(t\), of the accumulator, \(j\), when the drifts are drawn from a truncated normal distribution are given by the following equations. While all the parameters can in principle be different for each accumulator and each trial, one parameter needs to be fixed. In addition, the design of the experiment and some assumptions reduce the total number of parameters; see sections 2.3.2 and 2.3.3 for details.
\[\begin{align} F(t) &= \frac{1}{\Phi\left(\nu / s \right)} \left[1 + \frac{b - A - t \nu}{A} \Phi\left(\frac{b - A - t \nu}{t s}\right) - \frac{b - t \nu}{A} \Phi\left(\frac{b - t \nu}{t s}\right) + \frac{t s}{A} \phi\left(\frac{b - A - t \nu}{ts}\right) - \frac{t s}{A} \phi\left(\frac{b - t \nu}{ts}\right)\right] \label{eq:Fj} \\ f(t) &= \frac{1}{A \Phi(\nu/s)} \left[ -\nu\Phi(\frac{b-A-t\nu}{ts})+ s\phi(\frac{b-A-t\nu}{t s})+ \nu\Phi\left(\frac{b - t \nu}{t s}\right)- s \phi\left(\frac{b- A - t \nu}{t s}\right) \right] \label{eq:fj} \end{align}\]where \(\Phi\) and \(\phi\) are the cumulative distribution function and the probability density function of the standard normal distribution respectively.
In order to calculate the likelihood of a given choice, \(c\), in a given response time, \(t\), we need the probability density function for the accumulator \(c\) being the first to reach the threshold in a certain time. This is given by:
\[\begin{equation} PDF_c(t) = f_c(t)\prod_{c\neq j} (1 - F_j(t)) \label{eq:PDFc} \end{equation}\]The main challenge in the Stan implementation of the log-transformed cumulative distribution function, \(F\), and probability density function, \(f\), given by equations \eqref{eq:Fj} and \eqref{eq:fj} is the finite-precision arithmetic rounding errors and potential over/underflows: Some of the terms of the functions include divisions by \(A\), which might make those terms very large or overflow during the sampling, other terms include multiplications by \(\phi\) or \(\Phi\), which might put the terms in very different scales. Together with that, \(\Phi\) will underflow to \(0\) for values below \(-37.5\) and overflow to \(1\) for values above \(8.25\). While choosing the starting values of the sampler carefully might alleviate these problems, it is still an issue with hierarchical implementations where each parameter is free to vary by condition, accumulator, and participant.
First, I show the approximations used to ameliorated these problems, and then I show the implementation of the log-transformed \(F\), \(f\), and \(1-F\), that I used to calculate the likelihood, \(PDF_c\) (as given in Equation \eqref{eq:PDFc}).
Instead of \(\Phi\), it is possible to use an approximation in log-scale, nlcdf_approx()
, which is more robust in the tails (Bowling et al. 2009). For convenience, I also define a log-transformed \(\phi\), nlpdf()
.
real nlcdf_approx(real x) {
return log_inv_logit(0.07056 * x^3 + 1.5976* x);
}
real nlpdf(real x) {
return normal_lpdf(x | 0, 1);
}
Given that the terms of \(F\) and \(f\) are either summed or subtracted depending on the value of the parameters (\(\nu\) can be positive, negative or zero, and while \(b - A\) must be strictly positive, \(b- A - t \nu\) can have any sign or be zero), the use of the Stan built-in function log_sum_exp
and log_diff_exp
is problematic. My solution was to implement log_calc_exp
as a stable computation of the logarithm of the sums and subtractions of exponents, and apply it to \(F\) and \(f\) (using a similar logic to log_sum_exp
and log_diff_exp
). The following listing shows the function, which takes an array of “terms”, where each term is in turn an array that indicates with its first position whether the exponent of the second position should be summed or subtracted (or ignored if zero):
real log_calc_exp(real[,] terms) {
// Each term in the calculation is represented by an array of length two, t,
// t[1] > 0 indicates that exp(t[2]) should be summed
// t[1] < 0 indicates that exp(t[2]) should be subtracted
// t[1] == 0 indicates that exp(t[2]) should be ignored
real out;
real c = max(terms[, 2]);
int nterms = dims(terms)[1];
real expterms[nterms];
if(dims(terms)[2] > 2) print("The extra dimensions in log_calc_exp will be ignored.")
for(t in 1:nterms){
real sign = terms[t, 1];
if(sign == 0)
expterms[t] = 0;
else if(sign > 0 )
expterms[t] = exp(terms[t, 2] - c);
else if(sign < 0 )
expterms[t] = -exp(terms[t, 2] - c);
}
out = c + log(sum(expterms));
return out;
}
The log-likelihood function, \(PDF_c\), is based on log of \(f\), log of \(F\), and the log of \(1-F\) that I show in the following listings.
The log-probability density function of the accumulator \(j\) for \(t\) given the parameters \(A\), \(b\), \(\nu\), and \(s\) is defined in a (relatively) straightforward way following Equation \eqref{eq:fj} and using log_calc_exp
:
real tLBA_lpdf(real t, real A, real b, real nu, real s){
real bAt = (b - A - t * nu)/(t * s);
real bt = (b - t * nu)/(t * s);
real prod_term = - log(A) - nlcdf_approx(nu/s);
real terms[4, 2] = {{nu, log(fabs(nu)) + nlcdf_approx(bt)},
{1, log(s) + nlpdf(bAt)},
{-nu, log(fabs(nu)) + nlcdf_approx(bAt)},
{-1, log(s) + nlpdf(bt)}};
real lpdf = log_calc_exp(terms) + prod_term;
if(is_nan(lpdf)) reject("Negative PDF!!")
return lpdf;
}
For the implementation of the log of the cumulative distribution function of the accumulator \(j\) for \(t\) given the parameters \(A\), \(b\), \(\nu\), and \(s\), I also need to take care of errors of approximation that might bring the function to values that are not permitted: the cumulative distribution function is bounded between 0 and 1 and thus its log is bounded between -Inf and 0:
real tLBA_lcdf(real t, real A, real b, real nu, real s){
real bmte = b - t * nu;
real bmAte = bmte - A;
real bt = bmte/(t * s);
real bAt = bmAte/(t * s);
real logs = log(s);
real logt = log(t);
real logA = log(A);
real logb = log(b);
real prod_term = -nlcdf_approx(nu/s);
real lcdf;
real terms[4, 2] = {{-bmAte, log(fabs(bmAte)) + nlcdf_approx(bAt)},
{-1, logt + logs + nlpdf(bAt)},
{bmte,log(fabs(bmte)) + nlcdf_approx(bt)},
{1, logt + logs + nlpdf(bt)}};
real calc = log_calc_exp(terms) - logA ;
// log(calc) should be between 0 and 1
// We need lcdf =< 0 (exp(lcdf) is between 0 and 1)
if(is_nan(calc))
// This happens when calc = log(x < 0)
// This is because of approximation errors, and in this case
// lcdf should can be set as 0
lcdf = 0;
else if(calc >= 0 )
// If calc on the raw scale is larger than 1,
// also because of approx errors log(1 - x), with x > 1
// then cdf should be 0 and lcdf is -Inf.
lcdf = negative_infinity();
else
lcdf = log1m_exp(calc) + prod_term;
// lcdf can't be larger than zero
lcdf = lcdf > 0 ? 0 : lcdf;
return lcdf;
}
The implementation of the log of the complementary cumulative distribution function of the accumulator \(j\) for \(t\) given the parameters \(A\), \(b\), \(\nu\), and \(s\) is defined as follows:
real tLBA_lccdf(real t, real A, real b, real nu, real s){
real lcdf = tLBA_lcdf(t|A, b, nu, s);
real lccdf;
if(lcdf == 0) {
lccdf = negative_infinity() ;
} else if(lcdf == negative_infinity()) {
lccdf = 0;
} else{
lccdf = log1m_exp(tLBA_lcdf(t | A, b, nu, s));
}
if(is_nan(lccdf))
reject("lccdf is NaN, with parameters t:", t,", A:", A,", b:",b,", nu:", nu:", s:",s);
return(lccdf);
}
Finally, in order to calculate the likelihood of the parameters for a given choice, \(c\) (response
), in a given response time, \(t\), I implemented the log-transformed probability density function for the accumulator \(c\) being the first to reach the threshold in a certain time, \(PDF_c\), following Equation \eqref{eq:PDFc}. This implementation takes into account the non-decision time \(t_0\) that also needs to be estimated:
real tLBA(int response, real RT, vector A, vector b,
vector nu, vector s, real t_0){
real log_lik = 0;
int N_acc = num_elements(eta);
for(c in 1:N_acc){
// Warnings:
if (s[c]<=0)
reject("s[",c,"] <= 0; found s[",c,"] = ", s[c])
if (A[c]<=0)
reject("A[",c,"] <= 0; found A[",c,"] = ", A[c])
if (b[c]<=A[c])
reject("b[",c,"] <= A[",c,"]; found A[",c,"] = " , A[c], "b[",c,"] = ",b[c])
if (RT<=t_0)
reject("RT <= t_0; found RT = ", RT, "t_0 = ",t_0)
if (t_0<=0)
reject("t_0 <= 0; found t_0 = ", t_0)
if (RT<=0)
reject("RT <= 0; found RT = ", RT)
// end of Warnings
if(c == response)
log_lik = log_lik + tLBA_lpdf(RT - t_0|A[c], b[c], nu[c], s[c]);
else
log_lik = log_lik + tLBA_lccdf(RT - t_0|A[c], b[c], nu[c], s[c]);
}
return(log_lik);
}
I test this implementation by fitting the synthetic data of a single subject that performs 300 trials of the Stroop task aforementioned. There are several theories used to explain the Stroop effect (see MacLeod 1991 for a review), for simplicity and ease of exposition, I will assume that there is a drift associated with the accumulation of evidence for the color, \(\beta_{color}\), and one associated with the accumulation of evidence for the identity of the written word \(\beta_{word}\). The color-drift, \(\beta_{color}\), represents the task-relevant accumulation of evidence, and the word-drift, \(\beta_{color}\) represents the task-irrelevant accumulation of evidence. Thus a larger value in the latter indicates lower cognitive control, which will lead to stronger Stroop effects (i.e., larger differences between congruent and incongruent conditions). This entails that in congruent trials both drifts would act in an additive way, and in incongruent trials the two drifts would be assigned to different accumulators. (It is also possible to have a more complex implementation where there is a different color-drift for each one of the three colors, and a different word-drift for each one of the three presented word).
When the word red is presented in color red (congruent trial), for example, the average drift for each accumulator would be the following:
\[\begin{equation} \begin{aligned} \nu_{red} &= \beta_{color} + \beta_{word}\\ \nu_{blue} &= 0\\ \nu_{green} &= 0 \end{aligned} \end{equation}\]When the word red is presented in color green (incongruent trial), for example, the average drift for each accumulator would be the following:
\[\begin{equation} \begin{aligned} \nu_{red} &= \beta_{word}\\ \nu_{blue} &= 0\\ \nu_{green} &= \beta_{color} \end{aligned} \end{equation}\]The following listing shows the code for generating the fake data of one participant:
# Number of trials
ns <- 300
single_subj <- tibble(color = rep(c("red", "blue", "green"), ns/3),
word = rep(c("red", "blue", "green"),each = ns/3) ) %>%
mutate(condition = if_else(color == word, "Congruent",
"Incongruent"),
cond = if_else(color == word, -1, 1),
ncolor = as.numeric(as.factor(color)),
nword = as.numeric(as.factor(word)))
# One accumulator for each response
N_acc <- 3
# True parameters
t_0 <- 0.3
A <- c(1.1, 0.9, 1.3)
b <- c(1.3, 1.25, 1.55)
s <- 1
# Drifts associated with evidence for color and word:
beta_color <- 4
beta_word <- 1.5
beta <- c(beta_color, beta_word)
# Matrix of drifts for each accumulator for each trial
nu <- matrix(NA, ncol = N_acc, nrow = nrow(single_subj))
for(a in 1:N_acc){
nu[ ,a] <- beta_color * (single_subj$ncolor == a) +
beta_word * (single_subj$nword == a)
}
single_subj %<>% bind_cols(rLBA(nrow(single_subj), nu, A, b, s, t_0)) %>%
mutate(correct = ncolor == response)
single_subj
## # A tibble: 300 x 9
## color word condition cond ncolor nword RT response correct
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <int> <lgl>
## 1 red red Congruent -1. 3. 3. 0.406 3 TRUE
## 2 blue red Incongruent 1. 1. 3. 0.447 3 FALSE
## 3 green red Incongruent 1. 2. 3. 0.524 2 TRUE
## 4 red red Congruent -1. 3. 3. 0.431 3 TRUE
## 5 blue red Incongruent 1. 1. 3. 0.363 1 TRUE
## 6 green red Incongruent 1. 2. 3. 0.469 2 TRUE
## 7 red red Congruent -1. 3. 3. 0.396 3 TRUE
## 8 blue red Incongruent 1. 1. 3. 0.359 1 TRUE
## 9 green red Incongruent 1. 2. 3. 0.656 2 TRUE
## 10 red red Congruent -1. 3. 3. 0.497 3 TRUE
## # ... with 290 more rows
# Correct responses
mean(single_subj$correct)
## [1] 0.8733333
# Average RT in ms
mean(single_subj$RT)
## [1] 0.4787673
(meanRTs <- single_subj %>% filter(correct == TRUE) %>%
group_by(condition) %>%
summarize(mean=mean(RT) * 1000))
## # A tibble: 2 x 2
## condition mean
## <chr> <dbl>
## 1 Congruent 448.
## 2 Incongruent 490.
single_subj %>% filter(correct == TRUE) %>% mutate(RT = RT * 1000) %>%
ggplot(aes(RT, fill = condition)) +
geom_density(alpha = .5, bw = 100) +
geom_vline(data=meanRTs, aes(xintercept = mean, linetype = condition)) +
xlab("Response time (ms)")
Figure 1. The figure shows the response times for the correct responses of the synthetic data of one participant in the Stroop task.
In order to implement the full LBA model for the data of one participant, there are some decisions that need to be made.
While the parameter \(s\) (see Equation \eqref{eq:v}) could in principle vary between accumulators (or conditions), I set it to \(1\) for all accumulators, as it is generally done. It is required that at least one parameter would be fixed to satisfy a scaling property, which allows the model to yield unique estimates of the remaining parameters.
Even though it fairly common to assign one accumulator to correct responses, and one to all the incorrect ones, this ignores the possibility that participants may favor one particular response requiring less evidence (i.e., a response bias). Here, I take this possibility into account by assuming three (potentially) different accumulators, so that each one represents the response to a color, with its parameters, \(b_j\) and \(A_j\)(with \(j=\{1,2,3\}\)). Following Ratcliff (1978) (among others), however, I assume that changing the response bias is a relatively slow process, and so threshold parameters (\(b\) and \(A\)) of the accumulators do not depend on the stimulus presented for the current decision.
The model is re-parameterized using auxiliary parameters \(\boldsymbol{k}\), rather than \(\boldsymbol{b}\) so that \(\boldsymbol{b} = \boldsymbol{A} + \boldsymbol{k}\). The reason for this is that the dependency between \(\boldsymbol{b}\) and \(\boldsymbol{A}\) (\(b_j\) > \(A_j\)) makes the sampling more difficult. In addition, each \(k_j\) and \(A_j\) are parameterized so that the model does not ignore the common information among accumulators. This is to achieve a middle ground between completely independent parameters for the accumulators and identical parameters across the accumulators. This is done by having baseline parameters common for all the accumulators (e.g., \(A_{baseline}\) and \(k_{baseline}\)) and different adjustments (e.g., \(A_{diff}\) and \(k_{diff}\)) in the following way:
\[\begin{equation} \begin{aligned} \boldsymbol{A} &= A_{baseline} + \boldsymbol{H} \cdot \boldsymbol{A_{diff}}\\ \boldsymbol{k} &= k_{baseline} + \boldsymbol{H} \cdot \boldsymbol{k_{diff}} \end{aligned} \end{equation}\]where \(\boldsymbol{A}\) and \(\boldsymbol{k}\) are vectors with as many rows as there are accumulators (3 in this case), and \(\boldsymbol{A_{diff}}\) and \(\boldsymbol{k_{diff}}\) are vectors with number of accumulators minus one rows (two in this case). \(A_{baseline}\) and \(k_{baseline}\) are constrained to ensure that \(\boldsymbol{A}\) and \(\boldsymbol{k}\) are strictly positive. \(\boldsymbol{H}\) is a Helmert contrast matrix for as many levels as there are accumulators. Since the Helmert matrix is a centered and orthogonal matrix, the adjustments and the baseline parameters should be uncorrelated, making the sampling process easier. For this particular case with three accumulators, the Helmert matrix is the following:
\[\begin{equation} \begin{aligned} \boldsymbol{H} = \begin{bmatrix} -1 & -1 \\ 1 & -1 \\ 0 & 2 \end{bmatrix} \end{aligned} \end{equation}\]While the parameters associated to the accumulators are relatively independent, the samples of the non-decision time \(t_0\) tend to be correlated with \(k_{baseline}\) causing problems in the sampling. This is the case because the minimum time it takes to make a decision can be increased (or reduced) by either increasing (or reducing) either the threshold, \(b\), or the non-decision time, \(t_0\). This problem is partially solved with an informative prior on \(t_0\). Since \(t_0\) represents peripheral aspects of processing, such as encoding or motor execution (Rouder 2005), we can estimate that this will take at least 150 ms and it is unlikely to last more than 300 ms. This information can be encoded in the model with a prior such as \(LogNormal(-1.7, .35)\). The priors for the parameters are detailed below:
\[\begin{equation} \begin{aligned} A_{baseline} \sim& Normal(0, 2) \text{ with } A_{baseline} > -min(\boldsymbol{H} \cdot \boldsymbol{A_{diff}})\\ A_{diff} \sim& Normal(0, .5)\\ k_{baseline} \sim& Normal(0, 2) \text{ with } k_{baseline} > -min(\boldsymbol{H} \cdot \boldsymbol{k_{diff}})\\ k_{diff} \sim& Normal(0, .5)\\ t_0 \sim& LogNormal(-1.7, .5)\\ beta \sim& Normal(0, 10) \end{aligned} \end{equation}\]The complete Stan code tLBA_s.stan
following the previous specifications can be found here.
With 1000 iterations, the model as specified below takes approximately 40 minutes to finish. While there is no need to set initial values for the sampling, the model may reject some of the randomly generated initial values (issuing a warning message). This is because some combinations of unrealistic parameter values can lead to negative values of the probability density function due to approximation errors and over/underflows of terms inside the functions. This happens even restricting the interval used for generating random initial values (init_r = .5
). Restricting this interval is necessary, however, to help the model to find suitable initial values and to avoid discontinuities produced by the log_calc_exp
(which lead to divergent transitions). After verifying that the model converged (e.g., no divergent transitions, or warnings besides some rejections of initial values, all Rhats \(< 1.1\)), it shows to have recovered the underlying parameters fairly well. The fitted model can be downloaded from here.
library(rstan)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
# Make a list of the data:
lss <- as.list(single_subj)
lss$N_obs <- nrow(single_subj)
lss$N_pred <- 2
lss$N_acc <- N_acc
lss$X <- array(NA, c(lss$N_acc,lss$N_obs,lss$N_pred))
# One model matrix for each accumulator, drift_color and drift_word, and I remove the intercept
for(a in 1:lss$N_acc){
lss$X[a, , ] <- model.matrix(~ 1 + I(ncolor == a) +I(nword == a),
data = single_subj)[ ,-1]
}
m_ss <- stan(file = "tLBA_s.stan",
data = lss, chains = 4, iter = 1000,
control = list(adapt_delta = .9999, max_treedepth=13),
init_r = .5)
print(m_ss,pars=c("A","b","beta","t_0"), probs =c(.025,.975))
## Inference for Stan model: tLBA_s.
## 4 chains, each with iter=1000; warmup=500; thin=1;
## post-warmup draws per chain=500, total post-warmup draws=2000.
##
## mean se_mean sd 2.5% 97.5% n_eff Rhat
## A[1] 1.15 0.01 0.17 0.85 1.50 733 1.01
## A[2] 0.87 0.01 0.17 0.56 1.22 702 1.00
## A[3] 1.34 0.01 0.20 0.98 1.73 754 1.01
## b[1] 1.38 0.01 0.16 1.09 1.70 786 1.01
## b[2] 1.24 0.01 0.16 0.96 1.57 774 1.00
## b[3] 1.61 0.01 0.19 1.28 2.00 807 1.01
## beta[1] 3.92 0.01 0.38 3.24 4.70 712 1.01
## beta[2] 1.24 0.01 0.24 0.82 1.70 846 1.01
## t_0 0.29 0.00 0.01 0.27 0.31 854 1.00
##
## Samples were drawn using NUTS(diag_e) at Sat Apr 14 15:59:22 2018.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
Below I show the scaled difference (the difference divided by the size of the generated value) between the posterior means of the parameters of the model and the values we generated for those parameters. (The code producing the graph is adapted from Daniel C. Furr’s case study and it is available in the Rmd file.)
Figure 1. Scaled discrepancies between estimated and generating parameters of the model. Points indicate the scaled difference between the posterior means and generating values for a parameter, the thin and the bold horizontal lines indicate 95% and 50% posterior intervals for the difference respectively.
In order to answer the original question, that is, how to assess the cognitive control of the first 25 participants, I fit a hierarchical version of the previous model. The individual differences between participants are accounted by including adjustments to the drift rates, \(r_{\beta_{color},subj}\) and \(r_{\beta_{word},subj}\) so that each participant acquires evidence for color or for word in a slightly different way. The differences in cognitive control in this model are represented by the adjustment in the task-irrelevant drift, \(r_{\beta_{color},subj}\).
For each trial \(i\), for each accumulator \(j\), and each subject \(subj\):
\[\begin{equation} \nu_{j,i} = x_{color,j,i} \cdot (\beta_{color} + r_{\beta_{color},subj[i]}) + x_{word,j,i} \cdot (\beta_{word} + r_{\beta_{color},subj[i]}) \end{equation}\]where \(x_{color,j,i}\) is \(1\) if color
matches the accumulator \(j\) and \(0\) otherwise, and \(x_{word,j,i}\) is \(1\) if word
matches the accumulator \(j\) and \(0\) otherwise. The adjustments were assumed to be correlated, reflecting the possibility that faster answers for a given subject might be because of a higher drift for color and a higher drift for word. (The Stan code uses a non-centered parametrization; Papaspiliopoulos, Roberts, and Sköld 2007.)
where \(\boldsymbol\Sigma_{r_{\beta}}\) is built using
\[\begin{equation} \begin{aligned} \boldsymbol{\sigma_{r_\beta}} &\sim Normal_+(0,1) \\ \boldsymbol{\rho_{r_\beta}} &\sim lkj\_corr(2.0) \end{aligned} \end{equation}\]I accounted for the possible bias of the participants by including adjustments to the parameters of the accumulators. These adjustments reflect the possibility that subjects may have different thresholds or uncertainty in the decision for different colors. This was achieved by scaling the parameters \(A\) and \(k\) differently depending on the participants (while ensuring that they remain strictly positive).
\[\begin{align} A_{j,subj} = A_{j} \cdot \exp(r_{A_j,subj})\\ k_{j,subj} = k_{j} \cdot \exp(r_{k_j,subj}) \end{align}\]The adjustments are assumed to be correlated, reflecting the possibility that the accumulators might behave similarly for a given subject. This is defined by the following prior:
\[\begin{equation} \mathbf{r_{A,k}} \sim \mathcal{N}(\mathbf{0},\, \boldsymbol\Sigma_{r_{A,k}}) \end{equation}\] where \(\boldsymbol\Sigma_{r_{A,k}}\) is built using \[\begin{equation} \begin{aligned} \mathbf{\sigma_{r_A,k}} &\sim Normal_+(0,1) \\ \mathbf{\rho_{r_A,k}} &\sim lkj\_corr(2.0) \end{aligned} \end{equation}\] I included a by-participant adjustment to \(t_{0,subj}\) (in a similar way to Nicenboim and Vasishth 2018): \[\begin{equation} t_{0,subj} = exp(\psi_0 + r_{\psi,subj}) \end{equation}\]with
\[\begin{equation} \mathbf{r_{\psi}} \sim\ \mathcal{N}(\mathbf{0},\, \boldsymbol\sigma_{r_\psi}) \end{equation}\]The exponential function ensures that \(t_{0,subj}\) will be higher than zero. A constraint on the upper bound of the prior distribution of \(\psi_0\) ensures that the non-decision time of each participant is smaller than the participant minimum response time.
\[\begin{align} \psi_0 <& {\underset {i}{\operatorname {arg\,min} }}\,\left(log(RT_{i}) - r_{\psi,subj[i]}\right)\\ \psi_0 \sim& Normal(-1.7, .5) \end{align}\]The complete Stan model can be found here.
To fit the model, I recode the color of the word, the word, and the response as ncolor
, nword
, and response
respectively, with values 1 for “blue”, 2 for “green”, and 3 for “red”.
dstroop_25 %<>%
mutate(ncolor = as.numeric(as.factor(color)),
nword = as.numeric(as.factor(word)),
response = as.numeric(as.factor(trial_response)))
# Sanity check
dstroop_25 %>% group_by(subject) %>%
summarize(min(RT), mean(RT),max(RT)) %>% print(n=25)
## # A tibble: 25 x 4
## subject `min(RT)` `mean(RT)` `max(RT)`
## <dbl> <dbl> <dbl> <dbl>
## 1 1. 413. 745. 1484.
## 2 2. 402. 712. 2130.
## 3 3. 276. 512. 1636.
## 4 4. 378. 596. 1334.
## 5 5. 394. 622. 2178.
## 6 6. 338. 593. 1919.
## 7 7. 384. 790. 8936.
## 8 8. 350. 580. 1507.
## 9 9. 368. 713. 2453.
## 10 10. 360. 524. 1320.
## 11 11. 345. 539. 1147.
## 12 12. 365. 609. 1895.
## 13 13. 349. 516. 1161.
## 14 14. 295. 522. 1739.
## 15 15. 329. 579. 1663.
## 16 16. 377. 617. 1324.
## 17 17. 1. 549. 1688.
## 18 18. 372. 748. 2021.
## 19 19. 336. 582. 1799.
## 20 20. 287. 515. 2282.
## 21 21. 349. 572. 1424.
## 22 22. 340. 485. 1178.
## 23 23. 373. 671. 1511.
## 24 24. 368. 707. 2151.
## 25 25. 343. 598. 1728.
I remove the observation of a response time of 1 ms (since it’s likely an error in the dataset or a delayed response from the previous trial) and the observation of nearly 9 seconds.
# I exclude the RT of 1 ms:
dstroop_25 <- filter(dstroop_25, RT > 1, RT < 8000)
lstroop_25 <- as.list(dstroop_25)
lstroop_25$N_obs <- nrow(dstroop_25)
lstroop_25$N_pred <- 2
lstroop_25$N_acc <- 3 # 3 colors
lstroop_25$RT <- dstroop_25$RT/1000
lstroop_25$J_subj <- dstroop_25$subject
lstroop_25$N_subj <- max(dstroop_25$subject)
lstroop_25$X <- array(NA,
c(lstroop_25$N_acc, lstroop_25$N_obs, lstroop_25$N_pred))
# One model matrix for each accumulator:
for(a in 1:lstroop_25$N_acc){
lstroop_25$X[a,,] <- model.matrix(~ 1 + I(ncolor == a) +
I(nword == a),
data = dstroop_25)[ ,-1]
}
The model as specified below, with 6 chains and only 500 iterations, shows signs that it converged (e.g., no divergent transitions, or warnings besides some rejections of initial values, all Rhats \(< 1.1\)). However, fitting the model to only 25 subjects took a approximately 12 hours with 6 cores in a server. The fitted model can be downloaded from here.
m_h <- stan(file = "tLBA_h.stan",
data = lstroop_25, chains = 6, iter = 500,
control = list(adapt_delta = .9999, max_treedepth = 13),
init_r = .5)
The following listing shows the results for the population-level parameters (or fixed effects):
print(m_h, pars=c("A", "b", "beta", "t_0"), probs =c(.025, .975))
## Inference for Stan model: tLBA_h.
## 6 chains, each with iter=500; warmup=250; thin=1;
## post-warmup draws per chain=250, total post-warmup draws=1500.
##
## mean se_mean sd 2.5% 97.5% n_eff Rhat
## A[1] 1.06 0 0.09 0.88 1.24 1500 1
## A[2] 0.86 0 0.11 0.64 1.07 927 1
## A[3] 1.09 0 0.11 0.88 1.28 1500 1
## b[1] 1.79 0 0.09 1.62 1.96 1500 1
## b[2] 1.75 0 0.09 1.57 1.94 1500 1
## b[3] 1.90 0 0.09 1.73 2.09 1500 1
## beta[1] 3.38 0 0.11 3.16 3.60 1150 1
## beta[2] 0.36 0 0.07 0.22 0.50 1500 1
## t_0 0.17 0 0.01 0.15 0.20 1364 1
##
## Samples were drawn using NUTS(diag_e) at Sat Apr 14 11:46:01 2018.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
It is clear that there are no general response biases (the three parameters \(A\) and the three parameters \(b\) are similar to each other), and, as expected, the color of the word (\(\beta_{color}\) or beta[1]
in the listing) increases the rate of accumulation of evidence much more than the word (\(\beta_{word}\) or beta[2]
). This is expected given that the participants were asked to answer the color of the word and the general accuracy was of 96%.
As it is shown below with posterior predictive checks, the model has in general captured the patterns in the data (see the code in the Rmd file for the implementation of the generation of predicted datasets). While the model shows a great deal of uncertainty for the incorrect responses in the congruent condition, this is expected since there are very few observations in this category.
Figure 1. The red line shows the distribution of the real data and the blue lines show the distributions of some of the simulated datasets. The plots are shown for correct and incorrect responses in congruent and incongruent conditions.
The following listing shows the results for the group-level parameters (or random effects):
print(m_h,pars=c("sd_nu_subj","sd_Ak_subj","sd_psi_subj",
"Corr_nu[1,2]", "Corr_Ak[1,3]", "Corr_Ak[1,4]", "Corr_Ak[1,5]",
"Corr_Ak[1,6]", "Corr_Ak[2,3]", "Corr_Ak[2,4]", "Corr_Ak[2,5]",
"Corr_Ak[2,6]", "Corr_Ak[3,4]", "Corr_Ak[3,5]", "Corr_Ak[3,6]",
"Corr_Ak[4,5]", "Corr_Ak[4,6]", "Corr_Ak[5,6]"), probs =c(.025,.975))
## Inference for Stan model: tLBA_h.
## 6 chains, each with iter=500; warmup=250; thin=1;
## post-warmup draws per chain=250, total post-warmup draws=1500.
##
## mean se_mean sd 2.5% 97.5% n_eff Rhat
## sd_nu_subj[1] 0.35 0.00 0.08 0.20 0.53 630 1.01
## sd_nu_subj[2] 0.14 0.00 0.08 0.01 0.31 732 1.00
## sd_Ak_subj[1] 0.09 0.00 0.07 0.00 0.25 810 1.00
## sd_Ak_subj[2] 0.19 0.01 0.13 0.01 0.48 281 1.02
## sd_Ak_subj[3] 0.14 0.00 0.09 0.01 0.35 626 1.01
## sd_Ak_subj[4] 0.20 0.00 0.06 0.06 0.32 204 1.01
## sd_Ak_subj[5] 0.21 0.00 0.06 0.10 0.35 215 1.02
## sd_Ak_subj[6] 0.20 0.00 0.06 0.07 0.32 237 1.02
## sd_psi_subj 0.16 0.00 0.06 0.05 0.28 295 1.01
## Corr_nu[1,2] -0.34 0.01 0.37 -0.88 0.54 944 1.00
## Corr_Ak[1,3] 0.07 0.01 0.34 -0.61 0.68 1500 1.00
## Corr_Ak[1,4] 0.01 0.01 0.32 -0.59 0.63 488 1.01
## Corr_Ak[1,5] 0.08 0.02 0.32 -0.52 0.68 371 1.02
## Corr_Ak[1,6] 0.11 0.01 0.32 -0.53 0.70 516 1.00
## Corr_Ak[2,3] 0.02 0.01 0.33 -0.59 0.64 1500 1.00
## Corr_Ak[2,4] 0.04 0.02 0.32 -0.57 0.65 183 1.04
## Corr_Ak[2,5] -0.23 0.02 0.33 -0.76 0.51 309 1.01
## Corr_Ak[2,6] 0.20 0.02 0.33 -0.47 0.76 288 1.03
## Corr_Ak[3,4] 0.18 0.01 0.31 -0.48 0.71 445 1.01
## Corr_Ak[3,5] 0.21 0.02 0.31 -0.44 0.74 368 1.02
## Corr_Ak[3,6] -0.05 0.01 0.33 -0.64 0.58 517 1.01
## Corr_Ak[4,5] 0.50 0.01 0.24 -0.08 0.85 394 1.01
## Corr_Ak[4,6] 0.50 0.01 0.23 -0.07 0.83 361 1.01
## Corr_Ak[5,6] 0.35 0.01 0.26 -0.26 0.77 439 1.01
##
## Samples were drawn using NUTS(diag_e) at Sat Apr 14 11:46:01 2018.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
The group-level parameter estimates reveal that adding correlations might have been justified (see Corr_Ak[4,5]
, Corr_Ak[4,6]
, and Corr_Ak[5,6]
). Furthermore, it seems that there are some differences between participants (compare sd_nu_subj[2]
with beta[2]
), but with a great deal of uncertainty.
Plotting the adjustments for the color-drift, \(r_{\beta_{color}}\), could in principle reveal individual differences and a way to “rank” the participants. However, in this case, as the following plot shows, the low number of trials per participants do not allow us to clearly distinguish between them. It is only evident from the plot, that participants 2
and 18
show a higher color-drift, \(r_{\beta_{color}}\) indicating, presumably, that their cognitive control was lower than the cognitive control of the rest.
library(bayesplot)
posteriors <- as.array(m_h)
mcmc_intervals(posteriors, regex_pars="r_nu_subj\\[2,", prob_outer=.95, prob=.8)
Figure 8. The graph depicts the individual differences in color-drift, \(r_{\beta_{color}}\), which is presumably related to cognitive control.
The contribution of this notebook was in showing the Stan implementation of a model of choice, namely, the LBA model with a drift rate drawn from a normal distribution (restricted to positive values) (S. D. Brown and Heathcote 2008; Heathcote and Love 2012). Given that tasks that elicit a choice under time pressure are very common in cognitive science and psychology, models of choice, such as sequential-sampling models should be particularly useful. However, these models are difficult to implement and fit. Using the Stroop task (Stroop 1935) as a case study, I discussed the challenges in the implementation of a sequential-sampling model, the LBA model. Importantly, some of its challenges might also apply to models that require complex likelihood functions. Finally, I exemplified how a hierarchical version of the LBA can be used for examining individual differences. I showed that this implementation of the LBA successfully recovers parameter estimates and converges with a relatively small number of iterations even when assuming a strong hierarchical structure. However, some limitations of the current implementation are that as the datasets get larger, the model struggles to find initial values that can start the sampling processes, and that the fitting time can reach several days (even for datasets of a moderate size, e.g., 100 subjects doing the Stroop task).
Annis, Jeffrey, Brent J Miller, and Thomas J Palmeri. 2017. “Bayesian Inference with Stan: A Tutorial on Adding Custom Distributions.” Behavior Research Methods 49 (3). Springer: 863–86.
Botvinick, Matthew M, Todd S Braver, Deanna M Barch, Cameron S Carter, and Jonathan D Cohen. 2001. “Conflict Monitoring and Cognitive Control.” Psychological Review 108 (3). US: American Psychological Association: 624.
Bowling, Shannon R, Mohammad T Khasawneh, Sittichai Kaewkuekool, and Byung R Cho. 2009. “A Logistic Approximation to the Cumulative Normal Distribution.” Journal of Industrial Engineering and Management 2 (1).
Brown, Scott D, and Andrew Heathcote. 2008. “The Simplest Complete Model of Choice Response Time: Linear Ballistic Accumulation.” Cognitive Psychology 57 (3). Elsevier: 153–78. doi:10.1016/j.cogpsych.2007.12.002.
Bürkner, Paul-Christian. 2017. “brms: An R Package for Bayesian Multilevel Models Using Stan.” Journal of Statistical Software 80 (1): 1–28. doi:10.18637/jss.v080.i01.
Ebersole, Charles R., Olivia E. Atherton, Aimee L. Belanger, Hayley M. Skulborstad, Jill M. Allen, Jonathan B. Banks, Erica Baranski, et al. 2016. “Many Labs 3: Evaluating Participant Pool Quality Across the Academic Semester via Replication.” Journal of Experimental Social Psychology 67: 68–82. doi:https://doi.org/10.1016/j.jesp.2015.10.012.
Eriksen, Barbara A., and Charles W. Eriksen. 1974. “Effects of Noise Letters Upon the Identification of a Target Letter in a Nonsearch Task.” Perception & Psychophysics 16 (1): 143–49. doi:10.3758/BF03203267.
Forstmann, Birte U, Gilles Dutilh, Scott D Brown, Jane Neumann, D Yves Von Cramon, K Richard Ridderinkhof, and Eric-Jan Wagenmakers. 2008. “Striatum and Pre-Sma Facilitate Decision-Making Under Time Pressure.” Proceedings of the National Academy of Sciences 105 (45). National Acad Sciences: 17538–42.
Friedman, Naomi P, Akira Miyake, Susan E Young, John C DeFries, Robin P Corley, and John K Hewitt. 2008. “Individual Differences in Executive Functions Are Almost Entirely Genetic in Origin.” Journal of Experimental Psychology: General 137 (2). American Psychological Association: 201.
Heathcote, Andrew, and Jonathon Love. 2012. “Linear Deterministic Accumulator Models of Simple Choice.” Frontiers in Psychology 3. Frontiers Media SA. doi:10.3389/fpsyg.2012.00292.
Lansbergen, Marieke M, J Leon Kenemans, and Herman Van Engeland. 2007. “Stroop Interference and Attention-Deficit/Hyperactivity Disorder: A Review and Meta-Analysis.” Neuropsychology 21 (2). American Psychological Association: 251.
MacLeod, Colin M. 1991. “Half a Century of Research on the Stroop Effect: An Integrative Review.” Psychological Bulletin 109 (2). American Psychological Association: 163.
Nicenboim, Bruno, and Shravan Vasishth. 2018. “Models of Retrieval in Sentence Comprehension: A Computational Evaluation Using Bayesian Hierarchical Modeling.” Journal of Memory and Language 99: 1–34. doi:10.1016/j.jml.2017.08.004.
Novick, Jared M, John C Trueswell, and Sharon L Thompson-Schill. 2010. “Broca’s Area and Language Processing: Evidence for the Cognitive Control Connection.” Language and Linguistics Compass 4 (10). Wiley Online Library: 906–24.
Papaspiliopoulos, Omiros, Gareth O Roberts, and Martin Sköld. 2007. “A General Framework for the Parametrization of Hierarchical Models.” Statistical Science. JSTOR, 59–73.
Ratcliff, Roger. 1978. “A Theory of Memory Retrieval.” Psychological Review 85 (2). American Psychological Association: 59. doi:10.1037/0033-295X.85.2.59.
———. 1981. “A Theory of Order Relations in Perceptual Matching.” Psychological Review 88 (6). American Psychological Association: 552.
Ratcliff, Roger, Philip L Smith, Scott D Brown, and Gail McKoon. 2016. “Diffusion Decision Model: Current Issues and History.” Trends in Cognitive Sciences 20 (4). Elsevier: 260–81.
Rouder, Jeffrey N. 2005. “Are Unshifted Distributional Models Appropriate for Response Time?” Psychometrika 70 (2). Springer Science + Business Media: 377–81. doi:10.1007/s11336-005-1297-7.
Rouder, Jeffrey N., Jordan M. Province, Richard D. Morey, Pablo Gomez, and Andrew Heathcote. 2014. “The Lognormal Race: A Cognitive-Process Model of Choice and Latency with Desirable Psychometric Properties.” Psychometrika 80 (2). Springer Science + Business Media: 491–513. doi:10.1007/s11336-013-9396-3.
Stan Development Team. 2017. “RStan: The R Interface to Stan, Version 2.17.3.” http://mc-stan.org/rstan.html.
Stone, Mervyn. 1960. “Models for Choice-Reaction Time.” Psychometrika 25 (3). Springer: 251–60.
Stroop, J Ridley. 1935. “Studies of Interference in Serial Verbal Reactions.” Journal of Experimental Psychology 18 (6). Psychological Review Company: 643.
Terry, Andrew, AAJ Marley, Avinash Barnwal, E-J Wagenmakers, Andrew Heathcote, and Scott D Brown. 2015. “Generalising the Drift Rate Distribution for Linear Ballistic Accumulators.” Journal of Mathematical Psychology 68. Elsevier: 49–58.
Tillman, Gabriel, Titia Benders, Scott D Brown, and Don van Ravenzwaaij. 2017. “An Evidence Accumulation Model of Acoustic Cue Weighting in Vowel Perception.” Journal of Phonetics 61. Elsevier: 1–12.
Usher, Marius, and James L McClelland. 2001. “The Time Course of Perceptual Choice: The Leaky, Competing Accumulator Model.” Psychological Review 108 (3). American Psychological Association: 550. doi:10.1037/0033-295X.108.3.550.
Wabersich, Dominik, and Joachim Vandekerckhove. 2014. “The Rwiener Package: An R Package Providing Distribution Functions for the Wiener Diffusion Model.” R Journal 6 (1).
Thanks to Henrik Singmann and Andrew Heathcote for their assistance in reverse engineering the R package rtdists and thanks to Gabriel Tillman for his helpful advice on the parameters of the LBA. Thanks also to Shravan Vasishth for his comments.↩