Published July 24, 2025 | Version v1
Dataset Open

Replication package for EMSE article: "Retrospective Cohort Study In Action: A Pilot Investigation on the SonarQube Issue Impacting the Development Velocity"

  • 1. ROR icon University of Oulu
  • 2. University of Luxembourg
  • 3. ROR icon Tampere University
  • 4. ROR icon National Research Council

Description

This replication package contains all the Python and R source code to conduct the data collection, preprocessing and analysis of this study.

Contents

This repository contains the following

  • INSTALL: Detailed installation instructions for each of the used tools as well as the required Python dependencies.

  • Figures: Figures added in the PDF version of the manuscript. The analysis and scripts generate further figures that support the results of the study.

  • Codes:

  • Datasets: Contains all the required data to start, follow and finish the analysis of this study.

Getting Started

These instructions will get you a copy of the project up and running on your local machine. Beforehand, please follow the installation
instructions in the INSTALL documentation.

Prerequisites

Running the code requires Python3.9. See installation instructions here.

The dependencies needed to run the code are all listed in the file requirements.txt. They can be installed using pip:

pip install -r requirements.txt

You might also want to consider using virtual env.

Running the R code requires installing RStudio. Installation instructions can be found from the official webpage of the CRAN project.

For installing the necessary libraries. A two-step process is needed to run in any of the used R scripts.

For installing the packages: install.packages("package")
For importing the package: library(package)

List of required packages:
effsize
dplyr
psych
corrplot
AICcmodavg
xtable

Running the code

NOTE: Remember featuring the project folders as in the code. Change the name of the path names in each of the Python files.

  1. DATA-MINING PROCEDURES (All the content is described in the code)

    • NOTE: Make the structure of the folders in the same way displayed in figshare so that the code works, or else manage on your own the locations through the code.
      Subsequent csv files made out from the crawlers will be stored in the mentioned folders until the merge stage.

    1.1. Mining projects files with initial confounders from GitHub API

     - Use notebook ```apacheGitHub.ipynb```
     - Remember to get create a token in (https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token).
    

    1.2. Mining registered ASF projects in SonarCloud from its API

         - Execute ```sonarQubeCrawler.ipynb``` 
         - Remember to use create a token in (https://docs.sonarcloud.io/advanced-setup/web-api/).
    

    1.3. Mining issues from ASF repositories in Jira and GitHub

     - Use notebook ```issueCrawlerGithub.ipynb``` for issues tracked in GitHub and ```jiraCrawler.ipynb``` for issues tracked in Jira. 
       (No need to use token with Atlassian for Jira issues)
    

    1.4. Mining commits from ASF projects in GitHub.

     - Use ```commitCrawler.ipynb``` to crawl over the considered repositories and mine their commits. In addition it will handle the name difference for projects using SQ since their names in SonarCloud differ from how they are stated in GitHub. 
    

    COMPLETE SECTION TWO BEFORE CONTINUING TO THE NEXT POINT IN THIS SECTION

    1.5. Cloning the selected projects for repository based data collection.

     - Execute ```clone_projects.py```.
    

    1.6. Collecting experience, from the contributors commit activity in the project.

     - For experience in the repository,execute: 
     ```get_commits.py```
     ```get_developer_experience.py```
     
     - For experience in GitHub from the projects' contributors, execute: 
     ```contributors_api.py```
     ```repo_experience.py``` 
    

    1.7. Collecting size and complexity from the projects' repository.
    - get_confounders_from_repos.py

    1.8. Merge experience data from the repository and GitHub level. Execute:
    - mergeAttributes.py
    - aggregateExp.py

  2. PREPROCESSING (Considering all the folders)

    2.1. Merge of common issues in Jira and GitHub per project.

     - Execute ```collectionMerge.ipynb``` until "VELOCITY CALCULATION DATA ADDITION STAGE".
    

    2.2. Velocity calculation from issue data.

     - Execute ```calculate_velocity.py``` Python file to calculate the development velocity. It can be either done in the command line, or 
       adding its' function call in the notebook.
    

    2.2. Pruning for the final dataset.

     - Continue executing the rest of functions in the ```collectionMerge.ipynb``` file mentioned before until the end.
    

    2.3. Variable unit change for some of the columns.

     - Through the python file ```format_data_fot_analysis.py``` the format of the ```SQ_cohort_data_in_days.csv``` dataset is modified before further features are added to the dataset in the next step. 
    
  3. DATA ANALYSIS (With specifications on what data changes to perform)

    3.1. Crude Analysis.

     - Execute all the commands one by one in the R file ```crudeanalysis_EMSE.R```
    

    3.2. Descriptive Analysis.

     - Execute all the commands one by one in the R file ```descanalysis_EMSE.R```
    

    3.3. Multicollinearity check.

     - Execute all commands one by one in the R file ```data-transformation_EMSE.R```
    

    3.4. Matching.

     -  Execute all commands one by one in the R file ```confounder-matching_EMSE.R```
    

    3.5. Regression Analysis.

     - With the mentioned changes performed in the final dataset run the regression analysis with the ```regressionanalysis_EMSE.R``` file. It's highly recommended to create a new R project in the working directory and then run the script.
    

Files

EMSE_replication_package.zip

Files (639.7 MB)

Name Size Download all
md5:86ac4194b5abd2e9939b96a4661ae7f3
639.7 MB Preview Download

Additional details

Software

Programming language
Python, R, Jupyter Notebook