This document contains a Bayesian analysis of the hypotheses investigated by Femmer et al.1.

Data Loading

Start by loading the prepared data.

source("../util/data-loading.R")
d <- load.data()

# print the data to ensure that all variables have the correct type
str(d)
## 'data.frame':    105 obs. of  21 variables:
##  $ PID                  : chr  "P1" "P1" "P1" "P1" ...
##  $ RID                  : chr  "R1" "R2" "R3" "R4" ...
##  $ Age                  : Ord.factor w/ 3 levels "19-24"<"25-30"<..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Program              : Ord.factor w/ 4 levels "Unknown"<"Bachelor"<..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ REQuizPerformance    : int  9 9 9 9 9 9 9 7 7 7 ...
##  $ ExpProgAca           : Ord.factor w/ 4 levels "no experience"<..: 3 3 3 3 3 3 3 2 2 2 ...
##  $ ExpProgInd           : Ord.factor w/ 4 levels "no experience"<..: 4 4 4 4 4 4 4 1 1 1 ...
##  $ ExpSEAca             : Ord.factor w/ 4 levels "no experience"<..: 4 4 4 4 4 4 4 3 3 3 ...
##  $ ExpSEInd             : Ord.factor w/ 4 levels "no experience"<..: 2 2 2 2 2 2 2 1 1 1 ...
##  $ ExpREAca             : Ord.factor w/ 4 levels "no experience"<..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ ExpREInd             : Ord.factor w/ 4 levels "no experience"<..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ passive              : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
##  $ actors.expected      : int  1 1 2 0 0 0 1 1 1 2 ...
##  $ actors.found         : int  1 1 2 0 0 0 1 1 1 2 ...
##  $ actors.missing       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ associations.expected: int  2 2 2 1 2 3 2 2 2 2 ...
##  $ associations.found   : int  0 0 0 1 0 2 2 2 2 2 ...
##  $ associations.missing : int  2 2 2 0 2 1 0 0 0 0 ...
##  $ objects.expected     : int  3 3 2 2 3 4 3 3 3 2 ...
##  $ objects.found        : int  1 2 1 2 2 4 2 3 3 2 ...
##  $ objects.missing      : int  2 1 1 0 1 0 1 0 0 0 ...

Bayesian Analysis

We perform the final step of the statistical causal inference framework by Siebert2, i.e., the estimation step, based on the two previous modeling and identification steps as documented in (the causal assumptions notebook)[./causal-assumptions.Rmd].

Formula

First, we define a regression model that predicts the number of missing associations based on a set of predictors identified during the causal assumptions. The most eligible probability distribution of the response variable is a Binomial distribution, since the response variable is a whole number bounded by the expected number of associations.

formula <- (associations.missing | trials(associations.expected) ~ 
              1 + # standard success in the task of finding actors
              passive + # impact of the use of passive voice
              (1|RID) + # impact of the difficulty of the individual requirements
              (1|PID) + # impact of the skill of individual participants
              ExpREAca + # impact of the participants academic experience in RE
              ExpREInd + # impact of the participants industrial experience in RE
              actors.missing + # impact of missing actors
              objects.missing # impact of missing domain objects
            )

get_prior(formula, family=binomial, data=d)
##                 prior     class            coef group resp dpar nlpar lb ub
##                (flat)         b                                            
##                (flat)         b  actors.missing                            
##                (flat)         b      ExpREAca.C                            
##                (flat)         b      ExpREAca.L                            
##                (flat)         b      ExpREAca.Q                            
##                (flat)         b      ExpREInd.C                            
##                (flat)         b      ExpREInd.L                            
##                (flat)         b      ExpREInd.Q                            
##                (flat)         b objects.missing                            
##                (flat)         b     passiveTRUE                            
##  student_t(3, 0, 2.5) Intercept                                            
##  student_t(3, 0, 2.5)        sd                                        0   
##  student_t(3, 0, 2.5)        sd                   PID                  0   
##  student_t(3, 0, 2.5)        sd       Intercept   PID                  0   
##  student_t(3, 0, 2.5)        sd                   RID                  0   
##  student_t(3, 0, 2.5)        sd       Intercept   RID                  0   
##        source
##       default
##  (vectorized)
##  (vectorized)
##  (vectorized)
##  (vectorized)
##  (vectorized)
##  (vectorized)
##  (vectorized)
##  (vectorized)
##  (vectorized)
##       default
##       default
##  (vectorized)
##  (vectorized)
##  (vectorized)
##  (vectorized)

Priors

For each of the selected predictors, we select an uninformative prior distribution that encodes our previous knowledge about the causal phenomenon.

priors <- c(prior(normal(-1, 1), class = Intercept),
            prior(normal(0, 1), class = b),
            prior(weibull(2, 1), class = sd, coef = Intercept, group = RID),
            prior(weibull(2, 1), class = sd, coef = Intercept, group = PID),
            prior(exponential(1), class = sd))

To assert that the priors are feasible, we sample from the priors without training the Bayesian model with the data.

m.prior <-
  brm(data = d, family = binomial, formula, prior = priors,
    iter = 4000, warmup = 1000, chains = 4, cores = 4,
    seed = 4, sample_prior="only",
    file = "fits/m.associations.prior"
  )

We visualize the results to graphically check the priors eligibility.

ndraws <- 400
brms::pp_check(m.prior, type="bars", ndraws=ndraws)

The distribution of the predicted values for the response variable (\(y_{rep}\)) overlaps with the actually observed response variable distribution (\(y\)), confirming the feasibility of the priors.

Training

Given the feasibility of the priors, we train the regression model. This process updates the prior distributions of all predictor coefficients.

m <-
  brm(data = d, family = binomial, formula, prior = priors,
    iter = 4000, warmup = 1000, chains = 4, cores = 4,
    seed = 4,
    file = "fits/m.associations"
  )

To assert that the training process worked correctly, we plot the Markov chains.

plot(m)

Further, we sample from the posterior distributions similar to the prior predictive check.

brms::pp_check(m, type="bars", ndraws=ndraws)

The distribution of the predicted values for the response variable (\(y_{rep}\)) still overlaps with the actually observed response variable distribution (\(y\)). Additionally, the distribution shrunk around the actually observed values, which indicates that the model has improved its predictive power through training.

Evaluation

Finally, we evaluate the trained model. An initial glance at the predictor coefficients shows their posterior distribution.

summary(m)
##  Family: binomial 
##   Links: mu = logit 
## Formula: associations.missing | trials(associations.expected) ~ 1 + passive + (1 | RID) + (1 | PID) + ExpREAca + ExpREInd + actors.missing + objects.missing 
##    Data: d (Number of observations: 105) 
##   Draws: 4 chains, each with iter = 4000; warmup = 1000; thin = 1;
##          total post-warmup draws = 12000
## 
## Group-Level Effects: 
## ~PID (Number of levels: 15) 
##               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept)     1.13      0.30     0.61     1.79 1.00     7000     8382
## 
## ~RID (Number of levels: 7) 
##               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept)     0.86      0.30     0.37     1.54 1.00     6377     5620
## 
## Population-Level Effects: 
##                 Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept          -1.14      0.60    -2.34     0.04 1.00     8539     8721
## passiveTRUE         0.70      0.62    -0.56     1.90 1.00     9294     8081
## ExpREAca.L         -0.11      0.64    -1.39     1.14 1.00     9185     8460
## ExpREAca.Q         -0.18      0.67    -1.50     1.12 1.00     9597     9347
## ExpREAca.C         -0.66      0.57    -1.75     0.48 1.00     8730     8215
## ExpREInd.L          0.19      0.60    -1.01     1.35 1.00     8783     8672
## ExpREInd.Q          0.40      0.69    -0.97     1.74 1.00    10348     9023
## ExpREInd.C          0.58      0.70    -0.81     1.93 1.00    10320     9010
## actors.missing      0.55      0.57    -0.58     1.67 1.00    12524     9739
## objects.missing     1.12      0.42     0.32     1.96 1.00    13762     9598
## 
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

These coefficient distributions do not show the full uncertainty of the impact of each predictor, though. Therefore, we plot the marginal effect of relevant predictors. The most relevant predictor is the treatment of the experiment, the use of passive voice. The plot shows that the use of passive voice (passive=TRUE) slightly increases the likelihood of missing an association, but since the confidence intervals overlap, this effect is not significant.

marginal.passive <- conditional_effects(m, effects="passive")
## Setting all 'trials' variables to 1 by default if not specified otherwise.
marginal.passive

Export the results to a CSV sheet to enable a figure of all three marginal plots in the end.

marginal.passive.values <- data.frame(marginal.passive[1]) %>% 
  select(passive.passive, passive.estimate__, passive.lower__, passive.upper__) %>% 
  rename(all_of(c(
    passive = "passive.passive",
    estimate = "passive.estimate__", 
    ci.lower = "passive.lower__", 
    ci.upper = "passive.upper__"))) %>% 
  mutate(
    response = "associations"
  ) %>% 
  write_csv(file = "../../data/results/marginal-associations.csv")

The results contradict the conclusions of the original study. If we investigate the marginal effects of missing actors and domain objects, we see that these also have a slight impact on the number of missing associations.

marginal.others <- conditional_effects(m, 
                                       effects=c("actors.missing", "objects.missing"),
                                       conditions=data.frame(passive=c(TRUE)))
## Setting all 'trials' variables to 1 by default if not specified otherwise.

To generate a combined plot, we extract the data from the marginal plot.

marginal.missing.actors.values <- data.frame(marginal.others[1]) %>% 
  select(actors.missing.actors.missing, actors.missing.estimate__, actors.missing.lower__, actors.missing.upper__) %>% 
  rename(all_of(c(
    n.missing = "actors.missing.actors.missing",
    estimate = "actors.missing.estimate__",
    ci.lower = "actors.missing.lower__",
    ci.upper = "actors.missing.upper__"))) %>%
  mutate(missing="actors")

marginal.missing.objects.values <- data.frame(marginal.others[2]) %>% 
  select(objects.missing.objects.missing, objects.missing.estimate__, objects.missing.lower__, objects.missing.upper__) %>% 
  rename(all_of(c(
    n.missing = "objects.missing.objects.missing",
    estimate = "objects.missing.estimate__",
    ci.lower = "objects.missing.lower__",
    ci.upper = "objects.missing.upper__"))) %>%
  mutate(missing="objects")

marginal.missing.values <- rbind(marginal.missing.actors.values, marginal.missing.objects.values)

Then, we visualize the isolated effect of both missing actors and domain objects on the likelihood of missing an association.

marginal.missing.values %>% 
  ggplot(aes(x=n.missing, y=estimate)) +
  geom_line(aes(color=missing)) + 
  geom_ribbon(aes(ymin=ci.lower, ymax=ci.upper, fill=missing), alpha=0.2) +
  labs(x="Number of missing actors/objects", y="Likelihood of missing an association") +
  theme(legend.position="bottom")

The above figure suggests that the number of missing actors/objects is continuous: the line and ribbon assign y values to x values between natural numbers (since those variables are integers). For a more intuitive interpretation, we just focus on the natural numbers for the number of missing actors/objects.

marginal.missing.values %>% 
  mutate(n.missing=ifelse(missing=="objects" & n.missing%%1<0.02 & n.missing-1<0.02 & n.missing-1>-0.02, 1, n.missing)) %>% 
  filter(n.missing%%1==0) %>% 
  ggplot(aes(x=ifelse(missing=="actors", n.missing-0.1, n.missing+0.1), y=estimate, color=missing)) +
  geom_point(size=4) +
  geom_errorbar(aes(ymin=ci.lower, ymax=ci.upper), lwd=1, width=0.2) +
  coord_flip() +
  #scale_x_reverse() +
  scale_x_reverse(breaks=c(0,1,2)) +
  labs(x="No. of missing actors/objects", y="Likelihood of missing an association") +
  theme(legend.position="bottom")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## i Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Similarly, we can visualize the isolated effect of industrial experience in requirements engineering. Again, the overlapping confidence intervals do not suggest a significant effect.

conditional_effects(m, effects="ExpREInd")
## Setting all 'trials' variables to 1 by default if not specified otherwise.

Finally, we can evaluate the model numerically. This is similar to the marginal effect, but can be summarized to the amount of times (in percent) that the use of passive voice produces fewer (-1), equal (0), or more (+1) missing actors.

evaluate.model(m)
## 
##        -1         0         1 
## 0.1707500 0.4906667 0.3385833

The evaluation also supports that the effect of passive voice on the number of missing actors is not significant.


  1. Femmer, H., Kučera, J., & Vetrò, A. (2014, September). On the impact of passive voice requirements on domain modelling. In Proceedings of the 8th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (pp. 1-4).↩︎

  2. Siebert, J. (2023). Applications of statistical causal inference in software engineering. Information and Software Technology, 107198.↩︎