This notebook specifies the causal assumptions we make about the impact that passive voice in requirements specifications has an impact on the domain modeling activity. It constitutes the first two steps of the framework for statistical causal inference by Siebert1, modeling and identification.
During the modeling step, we make our causal assumptions explicit.
The selection of variables is constrained by the variables that were recorded in the original experiment by Femmer et al.2. The following variables are available to us:
source("../util/data-loading.R")
d <- load.data()
# print the data to ensure that all variables have the correct type
str(d)
## 'data.frame': 105 obs. of 21 variables:
## $ PID : chr "P1" "P1" "P1" "P1" ...
## $ RID : chr "R1" "R2" "R3" "R4" ...
## $ Age : Ord.factor w/ 3 levels "19-24"<"25-30"<..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Program : Ord.factor w/ 4 levels "Unknown"<"Bachelor"<..: 3 3 3 3 3 3 3 3 3 3 ...
## $ REQuizPerformance : int 9 9 9 9 9 9 9 7 7 7 ...
## $ ExpProgAca : Ord.factor w/ 4 levels "no experience"<..: 3 3 3 3 3 3 3 2 2 2 ...
## $ ExpProgInd : Ord.factor w/ 4 levels "no experience"<..: 4 4 4 4 4 4 4 1 1 1 ...
## $ ExpSEAca : Ord.factor w/ 4 levels "no experience"<..: 4 4 4 4 4 4 4 3 3 3 ...
## $ ExpSEInd : Ord.factor w/ 4 levels "no experience"<..: 2 2 2 2 2 2 2 1 1 1 ...
## $ ExpREAca : Ord.factor w/ 4 levels "no experience"<..: 2 2 2 2 2 2 2 2 2 2 ...
## $ ExpREInd : Ord.factor w/ 4 levels "no experience"<..: 1 1 1 1 1 1 1 1 1 1 ...
## $ passive : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
## $ actors.expected : int 1 1 2 0 0 0 1 1 1 2 ...
## $ actors.found : int 1 1 2 0 0 0 1 1 1 2 ...
## $ actors.missing : int 0 0 0 0 0 0 0 0 0 0 ...
## $ associations.expected: int 2 2 2 1 2 3 2 2 2 2 ...
## $ associations.found : int 0 0 0 1 0 2 2 2 2 2 ...
## $ associations.missing : int 2 2 2 0 2 1 0 0 0 0 ...
## $ entities.expected : int 3 3 2 2 3 4 3 3 3 2 ...
## $ entities.found : int 1 2 1 2 2 4 2 3 3 2 ...
## $ entities.missing : int 2 1 1 0 1 0 1 0 0 0 ...
These variables have the following meaning:
Variable | Meaning | Values |
---|---|---|
PID | Identifier of an experiment participant | {“P1”, …, “P9”, “A1”, …, “A10”} |
RID | Identifier of a requirements specification | {“R1”, …, “R7”} |
Age | Age group of the participant | {“19-24”, “25-30”, “31-40”} |
Program | Study program in which the participant is currently enrolled | {“Unknown”, “Bachelor”, “Master”, “Doctorate”} |
REQuizPerformance | Number of correct responses in a 10-question single choice questionnaire about RE | [0; 10] |
ExpProgAca | Academic experience in programming | {“no experience”, “up to 6 months”, “6 to 12 months”, “more than 12 months”} |
ExpProgInd | Industrial experience in programming | {“no experience”, “up to 6 months”, “6 to 12 months”, “more than 12 months”} |
ExpSEAca | Academic experience in software engineering | {“no experience”, “up to 6 months”, “6 to 12 months”, “more than 12 months”} |
ExpSEInd | Industrial experience in software engineering | {“no experience”, “up to 6 months”, “6 to 12 months”, “more than 12 months”} |
ExpREAca | Academic experience in requirements engineering | {“no experience”, “up to 6 months”, “6 to 12 months”, “more than 12 months”} |
ExpREInd | Industrial experience in requirements engineering | {“no experience”, “up to 6 months”, “6 to 12 months”, “more than 12 months”} |
passive | True if the requirements specification involved in the current experimental task used passive voice | {TRUE, FALSE} |
actors.expected | Number of expected actors in the sample solution of the domain model | \(\mathbb{N}\) |
actors.found | Number of relevant actors included in the the solution provided by the participant | [0, actors.expected] |
actors.missing | Number of relevant actors missing from the solution provided by the participant (i.e., actors.expected-actors.found) | [0, actors.expected] |
entities.expected | Number of expected domain objects in the sample solution of the domain model | \(\mathbb{N}\) |
entities.found | Number of relevant domain objects included in the the solution provided by the participant | [0, entities.expected] |
entities.missing | Number of relevant domain objects missing from the solution provided by the participant (i.e., entities.expected-entities.found) | [0, entities.expected] |
associations.expected | Number of expected associations in the sample solution of the domain model | \(\mathbb{N}\) |
associations.found | Number of relevant associations included in the the solution provided by the participant | [0, associations.expected] |
associations.missing | Number of relevant associations missing from the solution provided by the participant (i.e., associations.expected-associations.found) | [0, associations.expected] |
We assume the following causal relationships between variables:
Relationship | Hypothesis |
---|---|
Age \(\rightarrow\) Program | The older a participant the more likely it is that they have advanced further in their studies |
Age \(\rightarrow\) Exp(Prog/SE/RE)(Aca/Ind) | The older a participant the more likely it is that they have gained more (academic or industrial) experience in programming, software engineering, and requirements engineering |
Program \(\rightarrow\) Exp(Prog/SE/RE)Aca | The older a participant the more likely it is that they have gained more academic experience in programming, software engineering, and requirements engineering |
ExpSE(Aca/Ind) \(\rightarrow\) Exp(Prog/RE)(Aca/Ind) | The higher the experience in software engineering, the higher the experience in programming and requirements engineering as those are sub-areas of SE |
ExpREAca, ExpREInd \(\rightarrow\) REQuizPerformance | The higher the experience in requirements engineering, the better the performance in the RE quiz |
ExpRE(Aca/Ind) \(\rightarrow\) actors/associations/entities.missing | The higher the (industrial or academic) experience in requirements engineering, the fewer actors, associations, and entities are missing |
Passive \(\rightarrow\) actors/associations/entitiesmissing | If the requirement is written using passive voice, less actors, associations, and entities are found |
actors/entities.missing \(\rightarrow\) associations.missing | If an actor or entity was missed then associations between other actors/entities and the missed one are consequently also missing |
We can summarize our relevant variables and causal relationships in the following directed, acyclic graph (DAG):
dag <- dagify(
Program ~ Age,
ExpSEAca ~ Program + Age,
ExpSEInd ~ Program + Age,
ExpProgAca ~ Program + Age + ExpSEAca,
ExpProgInd ~ Program + Age + ExpSEInd,
ExpREAca ~ Program + Age + ExpSEAca,
ExpREInd ~ Program + Age + ExpSEInd,
REQuizPerformance ~ ExpREAca + ExpREInd,
actors.missing ~ ExpREAca + ExpREInd + passive,
entities.missing ~ ExpREAca + ExpREInd + passive,
associations.missing ~ ExpREAca + ExpREInd + passive + actors.missing + entities.missing,
exposure = "passive", outcome = c("actors.missing", "entities.missing", "associations.missing"),
labels = c(Age = "Age", Program = "Program", ExpSEAca = "Academic experience in SE", ExpSEInd = "Industrial experience in SE", ExpProgAca = "Academic experience in Programming", ExpProgInd = "Industrial experience in Programming", ExpREAca = "Academic experience in RE", ExpREInd = "Industrial experience in RE", REQuizPerformance = "Performance in RE Quiz", passive = "Passive Voice", actors.missing = "Number of missing actors", entities.missing = "Number of missing domain objects", associations.missing = "Number of missing associations"),
coords = list(
x=c(Age=0, Program=0.2, ExpProgAca=2.4, ExpProgInd=2.4, ExpSEAca=2, ExpSEInd=2, ExpREAca=2.4, ExpREInd=2.4, REQuizPerformance=2, passive=3, actors.missing=4, associations.missing=4,entities.missing=4),
y=c(Age=-1, Program=-2, ExpProgAca=0, ExpProgInd=-0.5, ExpSEAca=-1, ExpSEInd=-1.5, ExpREAca=-2, ExpREInd=-2.5, REQuizPerformance=-3, passive=-3.5, actors.missing=-1.5, associations.missing=-2.5,entities.missing=-3.5)
)
)
dag.plot.full <- ggdag_status(dag, use_labels="label", text=FALSE) +
guides(fill = "none", color="none") +
theme_dag()
dag.plot.full
During the identification step, we determine the variables relevant to be included in our regression model depending on the hypotheses we want to answer
We determine the adjustment sets, which automatically applies four criteria of causal reasoning to eliminate all potential variables that would introduce any bias like colliders.
adjustmentSets(dag, exposure="passive", outcome="actors.missing", effect="direct")
## {}
adjustmentSets(dag, exposure="passive", outcome="entities.missing", effect="direct")
## {}
adjustmentSets(dag, exposure="passive", outcome="associations.missing", effect="direct")
## { ExpREAca, ExpREInd, actors.missing, entities.missing }
Based on these adjustment sets, we consider the following subset of the original DAG as complete for our causal inference:
dag <- dagify(
actors.missing ~ ExpREAca + ExpREInd + passive,
entities.missing ~ ExpREAca + ExpREInd + passive,
associations.missing ~ ExpREAca + ExpREInd + passive + actors.missing + entities.missing,
exposure = "passive", outcome = c("actors.missing", "entities.missing", "associations.missing"),
labels = c(Age = "Age", Program = "Program", ExpSEAca = "Academic experience in SE", ExpSEInd = "Industrial experience in SE", ExpProgAca = "Academic experience in Programming", ExpProgInd = "Industrial experience in Programming", ExpREAca = "Academic experience in RE", ExpREInd = "Industrial experience in RE", REQuizPerformance = "Performance in RE Quiz", passive = "Passive Voice", actors.missing = "Number of missing actors", entities.missing = "Number of missing domain objects", associations.missing = "Number of missing associations"),
coords = list(
x=c(ExpREAca=2.4, ExpREInd=2.4, passive=3, actors.missing=4, associations.missing=4,entities.missing=4),
y=c(ExpREAca=-2, ExpREInd=-2.5, passive=-3.5, actors.missing=-1.5, associations.missing=-2.5,entities.missing=-3.5)
)
)
dag.plot.reduced <- ggdag_status(dag, use_labels="label", text=FALSE) +
guides(fill = "none", color="none") +
theme_dag()
dag.plot.reduced
This will be the DAG we use for the final step, the estimation.
Siebert, J. (2023). Applications of statistical causal inference in software engineering. Information and Software Technology, 107198.↩︎
Femmer, H., Kučera, J., & Vetrò, A. (2014, September). On the impact of passive voice requirements on domain modelling. In Proceedings of the 8th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (pp. 1-4).↩︎