In this notebook, we reproduce the analysis presented in Section 7 of the paper Vegas, S., Apa, C., & Juristo, N. (2015). Crossover designs in software engineering experiments: Benefits and perils. IEEE Transactions on Software Engineering, 42(2), 120-135. All quotes are direct quotes from this publication.
Firstly, we load the data provided by the authors. Since the data cannot be made public, it is not contained in this repository and has to be requested from the original authors.
data <- read.spss("../data/tse-2016-vegas/data.sav", to.data.frame = TRUE)
## re-encoding from UTF-8
Next, we visualize the raw data. The visualizations align with the figures from the original paper.
data %>% ggplot(aes(x=Technique, y=Effectiveness)) +
geom_boxplot()
data %>% ggplot(aes(x=Sequence, y=Effectiveness)) +
geom_boxplot()
data %>% ggplot(aes(x=Period_Program, y=Effectiveness)) +
geom_boxplot()
Next, we reproduce the data analyses from Section 7.4 of the paper.
Let us start by wrongly applying the independent samples t-test on the treatments.
data.paired <- data %>%
select(Subject, Technique, Effectiveness) %>%
pivot_wider(names_from = Technique, values_from = Effectiveness)
To find out whether we can run this test, the normality of each treatment sample needs to be checked. The Shapiro-Wilk test shows a significance value of 0.218 (greater than 0.05) for BT and 0.006 (smaller than 0.05) for EP, which means that EP does not conform to a normal distribution.
shapiro.test(data.paired$BT)
##
## Shapiro-Wilk normality test
##
## data: data.paired$BT
## W = 0.94203, p-value = 0.2179
shapiro.test(data.paired$EP)
##
## Shapiro-Wilk normality test
##
## data: data.paired$EP
## W = 0.86236, p-value = 0.00566
The obtained values are very close to the ones reported in the manuscript.
As the independent-samples t-test cannot be applied, we apply the Mann-Whitney test. We find that the null hypothesis cannot be rejected
wilcox.test(x=data.paired$BT, y=data.paired$EP, paired=FALSE)
## Warning in wilcox.test.default(x = data.paired$BT, y = data.paired$EP, paired =
## FALSE): cannot compute exact p-value with ties
##
## Wilcoxon rank sum test with continuity correction
##
## data: data.paired$BT and data.paired$EP
## W = 287.5, p-value = 0.272
## alternative hypothesis: true location shift is not equal to 0
The authors of the original paper report an asymptotic significance value of 0.267, which is very close to the one we calculate.
Let us now continue wrongly applying the paired-samples t-test on the treatments.
To find out whether we can run this test, the normality of the sample formed by the difference in the scores of each subject per treatment for the response variable needs to be checked.
data.paired.diff <-
data.paired %>%
mutate(delta = BT-EP)
shapiro.test(data.paired.diff$delta)
##
## Shapiro-Wilk normality test
##
## data: data.paired.diff$delta
## W = 0.8941, p-value = 0.02269
The Shapiro-Wilk test shows a significance value of 0.023 (smaller than 0.05), which indicates that the differences do not conform to a normal distribution.
The obtained value is again very close to the reported value.
As the paired-samples t-test cannot be used, we apply the non-parametric Wilcoxon test. We find that the null hypothesis cannot be rejected
wilcox.test(x=data.paired$BT, y=data.paired$EP, paired=TRUE)
## Warning in wilcox.test.default(x = data.paired$BT, y = data.paired$EP, paired =
## TRUE): cannot compute exact p-value with ties
## Warning in wilcox.test.default(x = data.paired$BT, y = data.paired$EP, paired =
## TRUE): cannot compute exact p-value with zeroes
##
## Wilcoxon signed rank test with continuity correction
##
## data: data.paired$BT and data.paired$EP
## V = 96.5, p-value = 0.1395
## alternative hypothesis: true location shift is not equal to 0
The authors of the original paper report a asymptotic significance value of 0.133, which is very close to the one we calculate.
Let us now continue (again wrongly) the analysis reported in Section 7.4.2, this time including the additional factors involved in a crossover design — session and sequence — in order to find out whether they are influencing effectiveness.
In order to use the paired-samples t-test to analyse the within-subjects factor period, we need to check the sample formed by the difference in the scores of each subject per treatment for the response variable for normality. The Shapiro-Wilk test shows a significance value of 0.021 (smaller than 0.05), which indicates that the differences do not conform to a normal distribution.
d2 <- data %>%
select(Subject, Period_Program, Effectiveness) %>%
pivot_wider(names_from = Period_Program, values_from = Effectiveness) %>%
mutate(delta = P1-P2)
shapiro.test(d2$delta)
##
## Shapiro-Wilk normality test
##
## data: d2$delta
## W = 0.89231, p-value = 0.02092
As the paired-samples t-test cannot be used, we apply the non-parametric Wilcoxon test. We find that the null hypothesis cannot be rejected, and therefore both periods are equally effective, as shown in Table 15 (sig. 0.086).
wilcox.test(x=d2$P1, y=d2$P2, paired=TRUE)
## Warning in wilcox.test.default(x = d2$P1, y = d2$P2, paired = TRUE): cannot
## compute exact p-value with ties
## Warning in wilcox.test.default(x = d2$P1, y = d2$P2, paired = TRUE): cannot
## compute exact p-value with zeroes
##
## Wilcoxon signed rank test with continuity correction
##
## data: d2$P1 and d2$P2
## V = 35.5, p-value = 0.0913
## alternative hypothesis: true location shift is not equal to 0
The value we obtain is reasonably close to the reported value.
In order to apply the independent-samples t-test to the between-subjects factor sequence,17 we need to check each treatment sample for normality. The Shapiro-Wilk test shows a significance value of 0.050 (equal to 0.05) for the sequence BT-EP, and 0.444 (greater than 0.05) for the sequence EP-BT. Therefore the sequence BT-EP does not conform to a normal distribution, while the sequence EP-BT does.
d3 <- data %>%
filter(Sequence == "EP-BT")
shapiro.test(d3$Effectiveness)
##
## Shapiro-Wilk normality test
##
## data: d3$Effectiveness
## W = 0.9547, p-value = 0.4441
d4 <- data %>%
filter(Sequence == "BT-EP")
shapiro.test(d4$Effectiveness)
##
## Shapiro-Wilk normality test
##
## data: d4$Effectiveness
## W = 0.91722, p-value = 0.05072
As the independent-samples t-test cannot be used, we apply the non-parametric Mann-Whitney test. We find that the null hypothesis cannot be rejected, and therefore both sequences are equally effective, as shown in Table 16 (sig. 0.086).
wilcox.test(x=d3$Effectiveness, y=d4$Effectiveness, paired=FALSE)
## Warning in wilcox.test.default(x = d3$Effectiveness, y = d4$Effectiveness, :
## cannot compute exact p-value with ties
##
## Wilcoxon rank sum test with continuity correction
##
## data: d3$Effectiveness and d4$Effectiveness
## W = 226, p-value = 0.7407
## alternative hypothesis: true location shift is not equal to 0
the best method for analysing models with random coefficients and data dependency due to repeated measures is the linear mixed model. The model includes the following terms: technique (treatment), period (confounded with program in this experiment) and sequence (confounded with period x technique and carryover in this experiment) as fixed factors, and subject as random factor nested within sequence.
m <- lmer(formula = Effectiveness ~ 1 + Sequence + Period_Program + Technique + (1|Sequence/Subject),
data = data)
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
## unable to evaluate scaled gradient
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
## Hessian is numerically singular: parameters are not uniquely determined
## Warning: Model failed to converge with 1 negative eigenvalue: -4.1e-08
summary(m)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: Effectiveness ~ 1 + Sequence + Period_Program + Technique + (1 |
## Sequence/Subject)
## Data: data
##
## REML criterion at convergence: 355.6
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.7296 -0.4336 -0.2007 0.7122 1.5258
##
## Random effects:
## Groups Name Variance Std.Dev.
## Subject:Sequence (Intercept) 185.262 13.611
## Sequence (Intercept) 1.993 1.412
## Residual 186.900 13.671
## Number of obs: 44, groups: Subject:Sequence, 22; Sequence, 2
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 52.778 5.745 34.304 9.186 8.99e-11 ***
## SequenceEP-BT -2.083 7.422 23.098 -0.281 0.7814
## Period_ProgramP2 7.639 4.139 20.002 1.846 0.0798 .
## TechniqueEP -9.027 4.139 20.002 -2.181 0.0413 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) SEP-BT Pr_PP2
## SequncEP-BT -0.591
## Prd_PrgrmP2 -0.327 0.000
## TechniqueEP -0.327 0.000 -0.091
## optimizer (nloptwrap) convergence code: 0 (OK)
## unable to evaluate scaled gradient
## Hessian is numerically singular: parameters are not uniquely determined
According to the tests of fixed effects shown in Table 17, the effectiveness of the equivalence partitioning technique is significantly different from branch testing (sig. 0.041). Equivalence partitioning is less effective than branch testing (effectiveness of 46.53 and 55.56 percent, respectively).
The reported value (p=0.041) is fairly close to the obtained value (Pr(>|t|) = 0.0413). The p-value of all other factors is similarly close.