This notebook calculates the inter-rater agreement of the two researchers that extracted data from eligible primary studies.

Data

The data is located in the data-extraction-ira.xlsx file (a snapshot of data-extraction.xlsx prior to any harmonization based on the discussions). Each of the papers extracted from by both raters is assigned to one of the raters. That rater contributed their extraction in the “Extraction” sheet of that file, the over one in the “Overlap” sheet.

data <- read_excel("../data/state-of-practice/data-extraction-ira.xlsx", sheet="Extraction", skip=2)
overlap <- read_excel("../data/state-of-practice/data-extraction-ira.xlsx", sheet="Overlap", skip=2)

The following method performs a data cleaning process on a data frame. It ensures two aspects:

  1. Columns of the data frame are renamed to R-conform labels (e.g., test.type instead of “Test-Type”).
  2. Columns of the data frame are cast to the appropriate data type.
cat.subjects <- c("Students", "Student Groups", "Practitioners", "Pracititioner Groups", "Both", "Mixed Groups", "Researchers", "Unknown", "Other")
cat.threats <- c("Period", "Sequence", "Skill", "Carryover")
cat.method <- c("NHST", "GLM", "GLMM", "GEE", "Other", "Unknown")
cat.type <- c("Unpaired T", "Paired T", "Mann-Whitney U", "Wilcoxon signed-rank", "ANOVA", "Kruskal-Wallis", "Other")
cat.addressal <- c("Modeled", "Stratified", "Isolated", "Acknowledged", "Neglected", "Ignored")
cat.availability <- c("Proprietary", "Private", "Unavailable", "Broken", "Upon Request", "Reachable", "Open Source", "Archived")

clean_data <- function(df) {
  df_cleaned <-  df %>% 
    rename(all_of(c(
      id = "Paper-ID",
      subject.number = "Subject-Number",
      subject.type = "Subject-Type",
  
      method = "Method",
      test.type = "Test Type",
      
      availability.data = "Data-Availability",
      availability.analysis = "Analysis-Availability"
      ))
    ) %>% 
  mutate(
    subject.type <- factor(subject.type, levels=cat.subjects, ordered=FALSE),
    subject.number = as.integer(subject.number),
    
    method <- factor(method, levels=cat.method, ordered=FALSE),
    test.type <- factor(test.type, levels=cat.type, ordered=FALSE),
    
    Period <- factor(Period, levels=cat.addressal, ordered=TRUE),
    Sequence <- factor(Sequence, levels=cat.addressal, ordered=TRUE),
    Skill <- factor(Skill, levels=cat.addressal, ordered=TRUE),
    Carryover <- factor(Carryover, levels=cat.addressal, ordered=TRUE),
    
    availability.data <- factor(availability.data, levels=cat.availability, ordered=TRUE),
    availability.analysis <- factor(availability.analysis, levels=cat.availability, ordered=TRUE),
  )
  return(df_cleaned)
}

When executing the data cleaning function, R will raise warnings that some NAs introduced by coercion. This is expected behavior, as the column cast to integers (subject.numnber) can contain NA values.

data <- clean_data(data)
## Warning: There was 1 warning in `mutate()`.
## i In argument: `subject.number = as.integer(subject.number)`.
## Caused by warning:
## ! NAs introduced by coercion
overlap <- clean_data(overlap)
## Warning: There was 1 warning in `mutate()`.
## i In argument: `subject.number = as.integer(subject.number)`.
## Caused by warning:
## ! NAs introduced by coercion

Finally, limit the sheet containing the full data (“Extraction”) to those entries that were overlapped by a second rater.

rating1 <- data %>% 
  filter(id %in% overlap$id)

rating1$id
## [1]   6   9  14  48  71  77 116 116 116

Inter-Rater Agreement Calculation

This section contains the calculation of the inter-rater agreement between the two raters. We apply different metrics to different data types.

Numerical

We assess the one numerical attribute (subject.number) using Pearson’s correlation coefficient (PCC).

cor(rating1$subject.number, overlap$subject.number, use='complete.obs', method='pearson')
## [1] 1

A PCC value of 1.0 shows perfect agreement between the two raters. The subject number (i.e., the number of participants involved in an experiment) is clear to extract from any manuscript.

Categorical

The following columns contain categorical data and, therefore, must be assessed in a different way.

col.cat <- c("subject.type", "method", "test.type", "Period", "Sequence", "Skill", "Carryover", "availability.data", "availability.analysis")

We chose to calculate Bennett’s S-score1 since it is robust against agreement by chance but also does not assume equal marginal distributions. The following method calculates Bennett’s S-score for two (equal-sized) lists of ratings.

bennett_s_score <- function(rater1, rater2) {
  if (length(rater1) != length(rater2)) {
    stop("Both rater lists must have the same length.")
  }
  
  conf_matrix <- table(rater1, rater2)
  
  k <- nrow(conf_matrix)
  n <- sum(conf_matrix)
  
  po <- sum(diag(conf_matrix)) / n
  pc <- 1 / k
  
  s_score <- (po - pc) / (1 - pc)
  return(s_score)
}

We calculate the inter-rater agreement via Bennett’s S-score for each column containing categorical data.

agreement <- c()
for (col in col.cat) {
  r1 <- rating1[[col]]
  r2 <- overlap[[col]]
  score <- bennett_s_score(r1, r2)
  agreement <- append(agreement, score)
}
agreement
## [1] 1.0000000 0.8333333 1.0000000 0.6666667 0.8518519 0.7037037 0.6666667
## [8] 1.0000000 1.0000000

Finally, we compute the average S-score over all columns.

mean(agreement)
## [1] 0.8580247

The average S-score ensures sufficient confidence in the reliability of the data extraction.


  1. Bennett, E. M., Alpert, R., & Goldstein, A. C. (1954). Communications through limited-response questioning. Public Opinion Quarterly, 18(3), 303-308.↩︎