This notebook specifies the causal assumptions we make about the impact that passive voice in requirements specifications has an impact on the domain modeling activity. It constitutes the first two steps of the framework for statistical causal inference by Siebert1, modeling and identification.

Modeling

During the modeling step, we make our causal assumptions explicit.

Variables

The selection of variables is constrained by the variables that were recorded in the original experiment by Femmer et al.2. The following variables are available to us:

source("../util/data-loading.R")
d <- load.data()

# print the data to ensure that all variables have the correct type
str(d)
## 'data.frame':    105 obs. of  21 variables:
##  $ PID                  : chr  "P1" "P1" "P1" "P1" ...
##  $ RID                  : chr  "R1" "R2" "R3" "R4" ...
##  $ Age                  : Ord.factor w/ 3 levels "19-24"<"25-30"<..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Program              : Ord.factor w/ 4 levels "Unknown"<"Bachelor"<..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ REQuizPerformance    : int  9 9 9 9 9 9 9 7 7 7 ...
##  $ ExpProgAca           : Ord.factor w/ 4 levels "no experience"<..: 3 3 3 3 3 3 3 2 2 2 ...
##  $ ExpProgInd           : Ord.factor w/ 4 levels "no experience"<..: 4 4 4 4 4 4 4 1 1 1 ...
##  $ ExpSEAca             : Ord.factor w/ 4 levels "no experience"<..: 4 4 4 4 4 4 4 3 3 3 ...
##  $ ExpSEInd             : Ord.factor w/ 4 levels "no experience"<..: 2 2 2 2 2 2 2 1 1 1 ...
##  $ ExpREAca             : Ord.factor w/ 4 levels "no experience"<..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ ExpREInd             : Ord.factor w/ 4 levels "no experience"<..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ passive              : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
##  $ actors.expected      : int  1 1 2 0 0 0 1 1 1 2 ...
##  $ actors.found         : int  1 1 2 0 0 0 1 1 1 2 ...
##  $ actors.missing       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ associations.expected: int  2 2 2 1 2 3 2 2 2 2 ...
##  $ associations.found   : int  0 0 0 1 0 2 2 2 2 2 ...
##  $ associations.missing : int  2 2 2 0 2 1 0 0 0 0 ...
##  $ entities.expected    : int  3 3 2 2 3 4 3 3 3 2 ...
##  $ entities.found       : int  1 2 1 2 2 4 2 3 3 2 ...
##  $ entities.missing     : int  2 1 1 0 1 0 1 0 0 0 ...

These variables have the following meaning:

Variable Meaning Values
PID Identifier of an experiment participant {“P1”, …, “P9”, “A1”, …, “A10”}
RID Identifier of a requirements specification {“R1”, …, “R7”}
Age Age group of the participant {“19-24”, “25-30”, “31-40”}
Program Study program in which the participant is currently enrolled {“Unknown”, “Bachelor”, “Master”, “Doctorate”}
REQuizPerformance Number of correct responses in a 10-question single choice questionnaire about RE [0; 10]
ExpProgAca Academic experience in programming {“no experience”, “up to 6 months”, “6 to 12 months”, “more than 12 months”}
ExpProgInd Industrial experience in programming {“no experience”, “up to 6 months”, “6 to 12 months”, “more than 12 months”}
ExpSEAca Academic experience in software engineering {“no experience”, “up to 6 months”, “6 to 12 months”, “more than 12 months”}
ExpSEInd Industrial experience in software engineering {“no experience”, “up to 6 months”, “6 to 12 months”, “more than 12 months”}
ExpREAca Academic experience in requirements engineering {“no experience”, “up to 6 months”, “6 to 12 months”, “more than 12 months”}
ExpREInd Industrial experience in requirements engineering {“no experience”, “up to 6 months”, “6 to 12 months”, “more than 12 months”}
passive True if the requirements specification involved in the current experimental task used passive voice {TRUE, FALSE}
actors.expected Number of expected actors in the sample solution of the domain model \(\mathbb{N}\)
actors.found Number of relevant actors included in the the solution provided by the participant [0, actors.expected]
actors.missing Number of relevant actors missing from the solution provided by the participant (i.e., actors.expected-actors.found) [0, actors.expected]
entities.expected Number of expected domain objects in the sample solution of the domain model \(\mathbb{N}\)
entities.found Number of relevant domain objects included in the the solution provided by the participant [0, entities.expected]
entities.missing Number of relevant domain objects missing from the solution provided by the participant (i.e., entities.expected-entities.found) [0, entities.expected]
associations.expected Number of expected associations in the sample solution of the domain model \(\mathbb{N}\)
associations.found Number of relevant associations included in the the solution provided by the participant [0, associations.expected]
associations.missing Number of relevant associations missing from the solution provided by the participant (i.e., associations.expected-associations.found) [0, associations.expected]

Causal Relationships

We assume the following causal relationships between variables:

Relationship Hypothesis
Age \(\rightarrow\) Program The older a participant the more likely it is that they have advanced further in their studies
Age \(\rightarrow\) Exp(Prog/SE/RE)(Aca/Ind) The older a participant the more likely it is that they have gained more (academic or industrial) experience in programming, software engineering, and requirements engineering
Program \(\rightarrow\) Exp(Prog/SE/RE)Aca The older a participant the more likely it is that they have gained more academic experience in programming, software engineering, and requirements engineering
ExpSE(Aca/Ind) \(\rightarrow\) Exp(Prog/RE)(Aca/Ind) The higher the experience in software engineering, the higher the experience in programming and requirements engineering as those are sub-areas of SE
ExpREAca, ExpREInd \(\rightarrow\) REQuizPerformance The higher the experience in requirements engineering, the better the performance in the RE quiz
ExpRE(Aca/Ind) \(\rightarrow\) actors/associations/entities.missing The higher the (industrial or academic) experience in requirements engineering, the fewer actors, associations, and entities are missing
Passive \(\rightarrow\) actors/associations/entitiesmissing If the requirement is written using passive voice, less actors, associations, and entities are found
actors/entities.missing \(\rightarrow\) associations.missing If an actor or entity was missed then associations between other actors/entities and the missed one are consequently also missing

Directed Acyclic Graph

We can summarize our relevant variables and causal relationships in the following directed, acyclic graph (DAG):

dag <- dagify(
  Program ~ Age,
  ExpSEAca ~ Program + Age,
  ExpSEInd ~ Program + Age,
  ExpProgAca ~ Program + Age + ExpSEAca,
  ExpProgInd ~ Program + Age + ExpSEInd,
  ExpREAca ~ Program + Age + ExpSEAca,
  ExpREInd ~ Program + Age + ExpSEInd,
  REQuizPerformance ~ ExpREAca + ExpREInd,
  actors.missing ~ ExpREAca + ExpREInd + passive,
  entities.missing ~ ExpREAca + ExpREInd + passive,
  associations.missing ~ ExpREAca + ExpREInd + passive + actors.missing + entities.missing,
  exposure = "passive", outcome = c("actors.missing", "entities.missing", "associations.missing"),
  labels = c(Age = "Age", Program = "Program", ExpSEAca = "Academic experience in SE", ExpSEInd = "Industrial experience in SE", ExpProgAca = "Academic experience in Programming", ExpProgInd = "Industrial experience in Programming", ExpREAca = "Academic experience in RE", ExpREInd = "Industrial experience in RE", REQuizPerformance = "Performance in RE Quiz", passive = "Passive Voice", actors.missing = "Number of missing actors", entities.missing = "Number of missing domain objects", associations.missing = "Number of missing associations"),
  
  coords = list(
    x=c(Age=0, Program=0.2, ExpProgAca=2.4, ExpProgInd=2.4, ExpSEAca=2, ExpSEInd=2, ExpREAca=2.4, ExpREInd=2.4, REQuizPerformance=2, passive=3, actors.missing=4, associations.missing=4,entities.missing=4),
    y=c(Age=-1, Program=-2, ExpProgAca=0, ExpProgInd=-0.5, ExpSEAca=-1, ExpSEInd=-1.5, ExpREAca=-2, ExpREInd=-2.5, REQuizPerformance=-3, passive=-3.5, actors.missing=-1.5, associations.missing=-2.5,entities.missing=-3.5)
  )
)

dag.plot.full <- ggdag_status(dag, use_labels="label", text=FALSE) + 
  guides(fill = "none", color="none") +
  theme_dag()
dag.plot.full

Identification

During the identification step, we determine the variables relevant to be included in our regression model depending on the hypotheses we want to answer

Adjustment sets

We determine the adjustment sets, which automatically applies four criteria of causal reasoning to eliminate all potential variables that would introduce any bias like colliders.

adjustmentSets(dag, exposure="passive", outcome="actors.missing", effect="direct")
##  {}
adjustmentSets(dag, exposure="passive", outcome="entities.missing", effect="direct")
##  {}
adjustmentSets(dag, exposure="passive", outcome="associations.missing", effect="direct")
## { ExpREAca, ExpREInd, actors.missing, entities.missing }

Reduced DAG

Based on these adjustment sets, we consider the following subset of the original DAG as complete for our causal inference:

dag <- dagify(
  actors.missing ~ ExpREAca + ExpREInd + passive,
  entities.missing ~ ExpREAca + ExpREInd + passive,
  associations.missing ~ ExpREAca + ExpREInd + passive + actors.missing + entities.missing,
  exposure = "passive", outcome = c("actors.missing", "entities.missing", "associations.missing"),
  labels = c(Age = "Age", Program = "Program", ExpSEAca = "Academic experience in SE", ExpSEInd = "Industrial experience in SE", ExpProgAca = "Academic experience in Programming", ExpProgInd = "Industrial experience in Programming", ExpREAca = "Academic experience in RE", ExpREInd = "Industrial experience in RE", REQuizPerformance = "Performance in RE Quiz", passive = "Passive Voice", actors.missing = "Number of missing actors", entities.missing = "Number of missing domain objects", associations.missing = "Number of missing associations"),
  
  coords = list(
    x=c(ExpREAca=2.4, ExpREInd=2.4, passive=3, actors.missing=4, associations.missing=4,entities.missing=4),
    y=c(ExpREAca=-2, ExpREInd=-2.5, passive=-3.5, actors.missing=-1.5, associations.missing=-2.5,entities.missing=-3.5)
  )
)

dag.plot.reduced <- ggdag_status(dag, use_labels="label", text=FALSE) + 
  guides(fill = "none", color="none") +
  theme_dag()

dag.plot.reduced

This will be the DAG we use for the final step, the estimation.


  1. Siebert, J. (2023). Applications of statistical causal inference in software engineering. Information and Software Technology, 107198.↩︎

  2. Femmer, H., Kučera, J., & Vetrò, A. (2014, September). On the impact of passive voice requirements on domain modelling. In Proceedings of the 8th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (pp. 1-4).↩︎