This document contains a reproduction of the original, frequentist analysis performed by Femmer et al.1.
We start by loading the data which is disclosed at https://doi.org/10.5281/zenodo.7499290. The original evaluation of the experiment by Femmer et al. only considers data about the responses.
d <- read.csv(file="../../data/raw/responses.csv")
The data consists of the following columns:
Column | Description | Data Type |
---|---|---|
PID |
Participant ID consisting of a group indicator (“A” for active, “P” for passive) and a numeric index | string |
RID |
Requirement ID | string |
MAct |
Number of missing actors | int |
MEnt |
Number of missing domain objects | int |
MAsc |
Number of missing associations | int |
str(d)
## 'data.frame': 105 obs. of 5 variables:
## $ PID : chr "P1" "P1" "P1" "P1" ...
## $ RID : chr "R1" "R2" "R3" "R4" ...
## $ MAct: int 0 0 0 0 0 0 0 0 0 0 ...
## $ MEnt: int 2 1 1 0 1 0 1 0 0 0 ...
## $ MAsc: int 2 2 2 0 2 1 0 0 0 0 ...
We replicate the null-hypothesis significance test of difference from the original study. The null hypothesis is that there is no difference in the number of missing actors, domain objects, or associations when producing a domain model from a requirements specification using active versus passive voice.
The original evaluation aggregates the results of each participant, i.e., Femmer et al. investigate whether the sum of missing actors, domain objects, and associations over the domain models from all requirements specifications differs between the two groups (active and passive).
d.pid <- d %>%
group_by(PID) %>%
summarize(
MAct = sum(MAct),
MEnt = sum(MEnt),
MAsc = sum(MAsc)
) %>% mutate(
passive = if_else(startsWith(PID, 'P'), 1, 0)
)
For the evaluation, we perform a Mann-Whitney U test (i.e., a two-sample Wilcoxon test) and calculate the effect size. Additionally - like in the original experiment - we calculate the mean and median of the sum of missing actors, domain objects, and associations.
# define a data frame with all fields that also the original study reported
results <- data.frame(
activity = character(),
mean.a = double(),
mean.p = double(),
median.a = double(),
median.p = double(),
p = double(),
conf.int.lower = double(),
conf.int.upper = double(),
cliffs.delta = double()
)
variable.name.map <- c("MAct"="actors", "MEnt"="objects", "MAsc"="associations")
for (var in c("MAct", "MEnt", "MAsc")) {
# calculate the mean and median of the dependent variable for both groups (active and passive)
mean.a = mean(filter(d.pid, passive==0)[[var]], na.rm=TRUE)
mean.p = mean(filter(d.pid, passive==1)[[var]], na.rm=TRUE)
median.a = median(filter(d.pid, passive==0)[[var]], na.rm=TRUE)
median.p = median(filter(d.pid, passive==1)[[var]], na.rm=TRUE)
# perform the Mann Whitney U test
hypo.test = wilcox.test(x = filter(d.pid, passive==1)[[var]], y = filter(d.pid, passive==0)[[var]],
conf.int = TRUE, paired = FALSE)
# calculate the effect size of the test
cliffs = cliffDelta(x = filter(d.pid, passive==1)[[var]], y = filter(d.pid, passive==0)[[var]])
results <- rbind(results,
list(
activity = variable.name.map[var],
mean.a = mean.a,
mean.p = mean.p,
median.a = median.a,
median.p = median.p,
p = hypo.test$p.value,
conf.int.lower = hypo.test$conf.int[1],
conf.int.upper = hypo.test$conf.int[2],
cliffs.delta = cliffs
))
}
The resulting data looks as follows:
knitr::kable(results, "simple")
activity | mean.a | mean.p | median.a | median.p | p | conf.int.lower | conf.int.upper | cliffs.delta |
---|---|---|---|---|---|---|---|---|
actors | 0.4285714 | 1.000 | 0 | 1 | 0.1921990 | -0.0000630 | 1.000017 | 0.375 |
objects | 1.2857143 | 2.000 | 1 | 1 | 0.5001306 | -1.0000052 | 2.999991 | 0.214 |
associations | 4.1428571 | 7.875 | 3 | 8 | 0.0296890 | 0.0000225 | 7.000051 | 0.679 |
We compare our results to the results of the original paper:
activity | mean.a | mean.p | median.a | median.p | p | conf.int | cliffs.delta |
---|---|---|---|---|---|---|---|
actors | 0.43 | 1.00 | 0 | 1 | 0.10 | (0; \(\infty\)) | 0.39 |
objects | 1.29 | 2.00 | 1 | 1 | 0.25 | (-1; \(\infty\)) | 0.25 |
associations | 4.14 | 7.88 | 3 | 8 | 0.02 | (1; \(\infty\)) | 0.75 |
Our results are very similar. The calculated p-values differ, though their implication (i.e., which null-hypothesis to reject with \(\alpha=0.05\)) remains the same, and the extreme ends of the confidence intervals are vastly different.
Femmer, H., Kučera, J., & Vetrò, A. (2014, September). On the impact of passive voice requirements on domain modelling. In Proceedings of the 8th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (pp. 1-4).↩︎