path_data <- "../../data"
path_figures <- "../../figures/demographics"

This file prepares the data for the Bayesian data analysis. It loads all individual data sheets, joins them, cleans them up, and saves the processed file for easier access. Then, it visualizes the distributions in the data to provide a feel for it.

Data Preparation

Loading

We start by loading the data which is disclosed at https://doi.org/10.5281/zenodo.7499290. This Zenodo repository contains a file called CorrectAnswers.pdf with all the data obtained during the experiment (responses as well as demographic data). We manually extracted this data into four separate sheets.

participants <- read.csv(file=file.path(path_data, "raw/participants.csv"))
experience <- read.csv(file=file.path(path_data, "raw/experience.csv"))
requirements <- read.csv(file=file.path(path_data, "raw/requirements.csv"))
responses <- read.csv(file=file.path(path_data, "raw/responses.csv"))

We corrected two mistakes in the data:

  1. The general experience points Gxp of participant P1 should be 11, not 10.
  2. The general experience points Gxp of participant A7 should be 18, not 12.

We identified these mistakes by comparing the values as reported in the CorrectAnswers.pdf with the raw responses contained in the Form Responses-Active.csv and Form Responses-Passive.csv.

Next, we join all four sheets to one data frame.

d <- responses %>% 
  full_join(participants) %>% # add information about the participants
  full_join(experience) %>% # add information about the participants' experience
  full_join(requirements) # get information about the expected actors etc. of the requirement
## Joining with `by = join_by(PID)`
## Joining with `by = join_by(PID)`
## Joining with `by = join_by(RID)`

This joined table contains one row per created domain model. With \(n_p=15\) participants and \(n_r=7\) requirements, we are observing \(15\times7=105\) created domain models.

Cleanup

Next, we clean up the joined data by performing the following operations:

  1. Introducing an explicit passive variable that encodes whether a domain model was created based on a requirements specification using active or passive voice
  2. Cutting off the variables containing the number of missing actors, domain objects, and associations at the number of expected actors, domain objects, and associations.

We performed the second step by counting the number of actors, domain objects, and associations that were expected in the sample solution according to CorrectAnswers.pdf. This is the maximum value for missing actors, domain objects, and associations in a domain model, since a domain model cannot miss more entities than there were entities expected.

d <- d %>%  mutate(
    passive = if_else(startsWith(PID, 'P'), TRUE, FALSE), # introduce a factor variable for passive voice

    actors.missing = if_else(MAct > EAct, EAct, MAct), # cutoff missing actors by expected actors
    actors.found = EAct-actors.missing, # inverse missed to found actors
    actors.expected = EAct, # spell out the variable
  
    associations.missing = if_else(MAsc > EAsc, EAsc, MAsc),
    associations.found = EAsc-associations.missing,
    associations.expected = EAsc,
    
    objects.missing = if_else(MEnt > EEnt, EEnt, MEnt),
    objects.found = EEnt-objects.missing,
    objects.expected = EEnt,
    
    REQuizPerformance = RExp)

Finally, we select only the relevant attributes of the resulting table.

d <- d %>% 
  select(
    PID, RID,
    Age, Program, REQuizPerformance,
    ExpProgAca, ExpProgInd, ExpSEAca, ExpSEInd, ExpREAca, ExpREInd,
    passive,
    actors.expected, actors.found, actors.missing,
    associations.expected, associations.found, associations.missing,
    objects.expected, objects.found, objects.missing)

Data Export

Store the data to make it reusable in the analyses.

write_csv(d, file=file.path(path_data, "ipv-data.csv"))

Data Visualization

Participants

The first figures visualize the distribution of the recorded context factors of the participating experiment subjects.

participants %>% ggplot(aes(x=Age)) +
  geom_histogram(stat="count")
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`

participants %>% ggplot(aes(x=Program)) +
  geom_histogram(stat="count")
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`

cat_exp <- c("no experience", "up to 6 months", "6 to 12 months", "more than 12 months")
experience$ExpProgAca <- factor(experience$ExpProgAca, levels=cat_exp, ordered=TRUE)
experience$ExpProgInd <- factor(experience$ExpProgInd, levels=cat_exp, ordered=TRUE)
experience$ExpSEAca <- factor(experience$ExpSEAca, levels=cat_exp, ordered=TRUE)
experience$ExpSEInd <- factor(experience$ExpSEInd, levels=cat_exp, ordered=TRUE)
experience$ExpREAca <- factor(experience$ExpREAca, levels=cat_exp, ordered=TRUE)
experience$ExpREInd <- factor(experience$ExpREInd, levels=cat_exp, ordered=TRUE)

experience %>% 
  pivot_longer(
    cols = c(ExpProgAca, ExpProgInd, ExpSEAca, ExpSEInd, ExpREAca, ExpREInd),
    names_to = "exp",
    values_to = "value"
    ) %>% 
  select(exp, value) %>% 
  group_by(exp, value) %>% 
  summarize(n=n()) %>% 
  ggplot(aes(x=exp, y=n, fill=fct_rev(value))) +
    geom_bar(position = "stack", stat = "identity") + 
    coord_flip()
## `summarise()` has grouped output by 'exp'. You can override using the `.groups`
## argument.

Requirements

Additionally, we output the number of expected actors, objects (EEnt), and associations.

requirements
##   RID EAct EEnt EAsc
## 1  R1    1    3    2
## 2  R2    1    3    2
## 3  R3    2    2    2
## 4  R4    0    2    1
## 5  R5    0    3    2
## 6  R6    0    4    3
## 7  R7    1    3    2

Finally, we plot the distribution of missing actors, domain objects, and associations from the 105 domain models produced during the experimental task.

d.missing <- d %>% 
  select(passive, actors.missing, objects.missing, associations.missing) %>% 
  pivot_longer(
    cols = c(actors.missing, objects.missing, associations.missing),
    names_to = "entity",
    values_to = "value"
  )

d.missing %>% 
  ggplot(aes(x=value, fill=passive)) +
  geom_boxplot() +
  facet_wrap(~entity, ncol=1)

Note that this is a distribution of the number of missing actors, domain objects, and associations per observation, not per participant like in the original study. The two plots are not comparable. The following figure replicates the figure from the original manuscript by aggregating all participants.

d %>% 
  group_by(PID) %>% 
  summarize(
    actors.missing = sum(actors.missing),
    objects.missing = sum(objects.missing),
    associations.missing = sum(associations.missing)
  ) %>% mutate(
    passive = if_else(startsWith(PID, 'P'), 1, 0)
  ) %>% 
  pivot_longer(
    cols = c(actors.missing, objects.missing, associations.missing),
    names_to = "entity",
    values_to = "value"
  ) %>% 
  mutate(
    entity = factor(entity, levels=c("actors.missing", "objects.missing", "associations.missing"), ordered=TRUE)
  ) %>% 
   ggplot(aes(x=value, fill=as.factor(passive))) +
   geom_boxplot() +
   facet_wrap(~entity) +
    coord_flip()