Replication package for EMSE article: "Retrospective Cohort Study In Action: A Pilot Investigation on the SonarQube Issue Impacting the Development Velocity"

Robredo, Mikel; Saarimäki, Nyyti; Oko-oboh, Agbonvihele Gregrey; Taibi, Davide; Lenarduzzi, Valentina

doi:10.5281/zenodo.16407138

Published July 24, 2025 | Version v1

Dataset Open

Replication package for EMSE article: "Retrospective Cohort Study In Action: A Pilot Investigation on the SonarQube Issue Impacting the Development Velocity"

1. University of Oulu
2. University of Luxembourg
3. Tampere University
4. National Research Council

This replication package contains all the Python and R source code to conduct the data collection, preprocessing and analysis of this study.

INSTALL: Detailed installation instructions for each of the used tools as well as the required Python dependencies.
Figures: Figures added in the PDF version of the manuscript. The analysis and scripts generate further figures that support the results of the study.
- Continuous cohort outline: The overall design of a cohort study with a continuous outcome.
- Fixed follow-up dates: The start and end of follow-up dates fixed.
- Relative follow-up dates: The start and end of follow-up dates relative to the subject.
- Confounding: Relationships between an exposure, outcome, and confounding factors.
- Study design: Graphical description of the stages addressed in the study design.
- Retrospective cohort study design: The design of the retrospective cohort study and timeline
- DAG: Directed acyclic graph demonstrating the relationships between the dependent, independent and confounding variables.
- Velocity calculation: Velocity calculation example diagram.
- Data collection: Data sources and measurement process diagram.
- Data analysis: Flow diagram of the data analysis process undergone and the methods used.
- Subject selection: Workflow of the data filtering and subject selection procedure.
- Exploratory Boxplot analysis: Exploratory visualization of the modelled confounder variables.
- Velocity Normality in studied groups: Probability density function plot and QQ plot of the dependent variable for each studied group.
- VIF results: Barplot identifying confounding variables that might be causing multicollinearity in our model.
- SMD results (Loveplot): Loveplot of the Standardized Mean Differences of the confounding variables.
- Velocity GLM residuls: Residuals plot (left) and Q-Q Residuals plot (right) between standardized model residuals (y-axis) and fitted values (left x-axis), and theoretical quantiles (right x-axis).
Codes:
- Jupyter notebooks
  - apacheGithub.ipynb: Downloads project metadata from the selected Apache projects.
  - collectionMerge.ipynb: Merges and cleans attributes from the collected data into the final shape of the dataset to be used in the data analysis.
  - commitCrawler.ipynb: GitHub API crawler to fetch project data from GitHub repositories.
  - issueCrawlerGithub.ipynb: GitHub API crawler to fetch project issue data from GitHub repositories.
  - jiraCrawler.ipynb: Jira API crawler to fetch project issue data from Jira repositories.
  - sonarQubeCrawler.ipynb: SonarQube API crawler to fetch project data from SonarQube repositories.
- Python codes:
  - aggregateExp.py: Aggregates the overall experience at a project level from all the activity before and during follow-up.
  - calculate_velocity.py: Script to calculate the development velocity for projects.
  - clone_projects.py: Performs the repostory clonning.
  - commons.py: Stores the global paths and variables.
  - contributors_api.py: Script to extract information from the collaborators involved in the software projects.
  - format_data_for_analysis.py: Reformats the time data format to adjust for the data analysis structure required by R functions.
    - get_commits.py: Auxiliary script to fetch and store project commits at a repository level (Some repositories differed with the data fetched from GitHub)
  - get_confounders_from_repos.py: Script to fetch identified confounders (e.g. Size...) from the cloned repositories.
  - get_developer_experience.py: Script to quantify the developer experience.
  - mergeAttributes.py: Script to merge attributes into single data file (some scripts did simply collect the data but this was not added into the same single file with the rest of attributes)
  - repo_experience.py: Aggregates the developer experience into a single metric per project.
- R codes:
  - confounder-matching_EMSE.R: Performs the Matching stage to reach variable balance and optimal overlap.
  - crudeanalysis_EMSE.R: Performs the crude (unadjusted) analysis of the study.
  - data-transformation_EMSE.R: Performs the multicollinearity check of the study.
  - descanalysis_EMSE.R: Performs the descriptive analysis of the study.
  - regressionanalysis_EMSE.R: Performs the statistical adjustment of the study.
Datasets: Contains all the required data to start, follow and finish the analysis of this study.

Getting Started

These instructions will get you a copy of the project up and running on your local machine. Beforehand, please follow the installation
instructions in the INSTALL documentation.

Prerequisites

Running the code requires Python3.9. See installation instructions here.

The dependencies needed to run the code are all listed in the file requirements.txt. They can be installed using pip:

pip install -r requirements.txt

You might also want to consider using virtual env.

Running the R code requires installing RStudio. Installation instructions can be found from the official webpage of the CRAN project.

For installing the necessary libraries. A two-step process is needed to run in any of the used R scripts.

For installing the packages: install.packages("package")
For importing the package: library(package)

List of required packages:
effsize
dplyr
psych
corrplot
AICcmodavg
xtable

Running the code

NOTE: Remember featuring the project folders as in the code. Change the name of the path names in each of the Python files.

DATA-MINING PROCEDURES (All the content is described in the code)

NOTE: Make the structure of the folders in the same way displayed in figshare so that the code works, or else manage on your own the locations through the code.
Subsequent csv files made out from the crawlers will be stored in the mentioned folders until the merge stage.

1.1. Mining projects files with initial confounders from GitHub API

 - Use notebook ```apacheGitHub.ipynb```
 - Remember to get create a token in (https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token).

1.2. Mining registered ASF projects in SonarCloud from its API

     - Execute ```sonarQubeCrawler.ipynb``` 
     - Remember to use create a token in (https://docs.sonarcloud.io/advanced-setup/web-api/).

1.3. Mining issues from ASF repositories in Jira and GitHub

 - Use notebook ```issueCrawlerGithub.ipynb``` for issues tracked in GitHub and ```jiraCrawler.ipynb``` for issues tracked in Jira. 
   (No need to use token with Atlassian for Jira issues)

1.4. Mining commits from ASF projects in GitHub.

 - Use ```commitCrawler.ipynb``` to crawl over the considered repositories and mine their commits. In addition it will handle the name difference for projects using SQ since their names in SonarCloud differ from how they are stated in GitHub.

COMPLETE SECTION TWO BEFORE CONTINUING TO THE NEXT POINT IN THIS SECTION

1.5. Cloning the selected projects for repository based data collection.

 - Execute ```clone_projects.py```.

1.6. Collecting experience, from the contributors commit activity in the project.

 - For experience in the repository,execute: 
 ```get_commits.py```
 ```get_developer_experience.py```
 
 - For experience in GitHub from the projects' contributors, execute: 
 ```contributors_api.py```
 ```repo_experience.py```

1.7. Collecting size and complexity from the projects' repository.
- get_confounders_from_repos.py

1.8. Merge experience data from the repository and GitHub level. Execute:
- mergeAttributes.py
- aggregateExp.py

PREPROCESSING (Considering all the folders)

2.1. Merge of common issues in Jira and GitHub per project.

 - Execute ```collectionMerge.ipynb``` until "VELOCITY CALCULATION DATA ADDITION STAGE".

2.2. Velocity calculation from issue data.

 - Execute ```calculate_velocity.py``` Python file to calculate the development velocity. It can be either done in the command line, or 
   adding its' function call in the notebook.

2.2. Pruning for the final dataset.

 - Continue executing the rest of functions in the ```collectionMerge.ipynb``` file mentioned before until the end.

2.3. Variable unit change for some of the columns.

 - Through the python file ```format_data_fot_analysis.py``` the format of the ```SQ_cohort_data_in_days.csv``` dataset is modified before further features are added to the dataset in the next step.

DATA ANALYSIS (With specifications on what data changes to perform)

3.1. Crude Analysis.

 - Execute all the commands one by one in the R file ```crudeanalysis_EMSE.R```

3.2. Descriptive Analysis.

 - Execute all the commands one by one in the R file ```descanalysis_EMSE.R```

3.3. Multicollinearity check.

 - Execute all commands one by one in the R file ```data-transformation_EMSE.R```

3.4. Matching.

 -  Execute all commands one by one in the R file ```confounder-matching_EMSE.R```

3.5. Regression Analysis.

 - With the mentioned changes performed in the final dataset run the regression analysis with the ```regressionanalysis_EMSE.R``` file. It's highly recommended to create a new R project in the working directory and then run the script.

Files

EMSE_replication_package.zip

Files (639.7 MB)

Name	Size	Download all
EMSE_replication_package.zip md5:86ac4194b5abd2e9939b96a4661ae7f3	639.7 MB	Preview Download

Additional details

Programming language: Python, R, Jupyter Notebook

	All versions	This version
Views	6	6
Downloads	1	1
Data volume	639.7 MB	639.7 MB

Replication package for EMSE article: "Retrospective Cohort Study In Action: A Pilot Investigation on the SonarQube Issue Impacting the Development Velocity"

Contents

Getting Started

Prerequisites

Running the code

COMPLETE SECTION TWO BEFORE CONTINUING TO THE NEXT POINT IN THIS SECTION

Files

EMSE_replication_package.zip

Files (639.7 MB)

Additional details

Software

Replication package for EMSE article: "Retrospective Cohort Study In Action: A Pilot Investigation on the SonarQube Issue Impacting the Development Velocity"

Creators

Description

Contents

Getting Started

Prerequisites

Running the code

COMPLETE SECTION TWO BEFORE CONTINUING TO THE NEXT POINT IN THIS SECTION

Files

EMSE_replication_package.zip

Files (639.7 MB)

Additional details

Software