This document contains a reproduction of the original, frequentist analysis performed by Femmer et al.1.

Data Loading

We start by loading the data which is disclosed at https://doi.org/10.5281/zenodo.7499290. The original evaluation of the experiment by Femmer et al. only considers data about the responses.

d <- read.csv(file="../../data/raw/responses.csv")

The data consists of the following columns:

Column Description Data Type
PID Participant ID consisting of a group indicator (“A” for active, “P” for passive) and a numeric index string
RID Requirement ID string
MAct Number of missing actors int
MEnt Number of missing domain objects int
MAsc Number of missing associations int
str(d)
## 'data.frame':    105 obs. of  5 variables:
##  $ PID : chr  "P1" "P1" "P1" "P1" ...
##  $ RID : chr  "R1" "R2" "R3" "R4" ...
##  $ MAct: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ MEnt: int  2 1 1 0 1 0 1 0 0 0 ...
##  $ MAsc: int  2 2 2 0 2 1 0 0 0 0 ...

Frequentist Analysis

We replicate the null-hypothesis significance test of difference from the original study. The null hypothesis is that there is no difference in the number of missing actors, domain objects, or associations when producing a domain model from a requirements specification using active versus passive voice.

The original evaluation aggregates the results of each participant, i.e., Femmer et al. investigate whether the sum of missing actors, domain objects, and associations over the domain models from all requirements specifications differs between the two groups (active and passive).

d.pid <- d %>% 
  group_by(PID) %>% 
  summarize(
    MAct = sum(MAct),
    MEnt = sum(MEnt),
    MAsc = sum(MAsc)
  ) %>% mutate(
    passive = if_else(startsWith(PID, 'P'), 1, 0)
  )

For the evaluation, we perform a Mann-Whitney U test (i.e., a two-sample Wilcoxon test) and calculate the effect size. Additionally - like in the original experiment - we calculate the mean and median of the sum of missing actors, domain objects, and associations.

# define a data frame with all fields that also the original study reported
results <- data.frame(
  activity = character(),
  mean.a = double(),
  mean.p = double(),
  median.a = double(),
  median.p = double(),
  p = double(),
  conf.int.lower = double(),
  conf.int.upper = double(),
  cliffs.delta = double()
)

variable.name.map <- c("MAct"="actors", "MEnt"="objects", "MAsc"="associations")

for (var in c("MAct", "MEnt", "MAsc")) {
  # calculate the mean and median of the dependent variable for both groups (active and passive)
  mean.a = mean(filter(d.pid, passive==0)[[var]], na.rm=TRUE)
  mean.p = mean(filter(d.pid, passive==1)[[var]], na.rm=TRUE)
  median.a = median(filter(d.pid, passive==0)[[var]], na.rm=TRUE)
  median.p = median(filter(d.pid, passive==1)[[var]], na.rm=TRUE)
  
  # perform the Mann Whitney U test
  hypo.test = wilcox.test(x = filter(d.pid, passive==1)[[var]], y = filter(d.pid, passive==0)[[var]], 
                  conf.int = TRUE, paired = FALSE)
  # calculate the effect size of the test
  cliffs = cliffDelta(x = filter(d.pid, passive==1)[[var]], y = filter(d.pid, passive==0)[[var]])
  
  results <- rbind(results,
                   list(
                     activity = variable.name.map[var],
                     mean.a = mean.a,
                     mean.p = mean.p,
                     median.a = median.a,
                     median.p = median.p,
                     p = hypo.test$p.value,
                     conf.int.lower = hypo.test$conf.int[1],
                     conf.int.upper = hypo.test$conf.int[2],
                     cliffs.delta = cliffs
                   ))
}

The resulting data looks as follows:

knitr::kable(results, "simple")
activity mean.a mean.p median.a median.p p conf.int.lower conf.int.upper cliffs.delta
actors 0.4285714 1.000 0 1 0.1921990 -0.0000630 1.000017 0.375
objects 1.2857143 2.000 1 1 0.5001306 -1.0000052 2.999991 0.214
associations 4.1428571 7.875 3 8 0.0296890 0.0000225 7.000051 0.679

Comparison

We compare our results to the results of the original paper:

activity mean.a mean.p median.a median.p p conf.int cliffs.delta
actors 0.43 1.00 0 1 0.10 (0; \(\infty\)) 0.39
objects 1.29 2.00 1 1 0.25 (-1; \(\infty\)) 0.25
associations 4.14 7.88 3 8 0.02 (1; \(\infty\)) 0.75

Our results are very similar. The calculated p-values differ, though their implication (i.e., which null-hypothesis to reject with \(\alpha=0.05\)) remains the same, and the extreme ends of the confidence intervals are vastly different.


  1. Femmer, H., Kučera, J., & Vetrò, A. (2014, September). On the impact of passive voice requirements on domain modelling. In Proceedings of the 8th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (pp. 1-4).↩︎