Published April 15, 2026 | Version v2
Publication Open

Differentiating Refactoring Practices: A Comparative Analysis of ML and Non-ML Frameworks

Description

Research  Overview:
This research is a longitudinal empirical study of refactoring practices between machine learning (ML) frameworks and non-ML frameworks in the field of software engineering. We mined over 1-million commits and over 4-million refactorings from 130 ML frameworks and 800 non-ML frameworks (i.e. 65 ML and 400 non-ML Java frameworks; 65-ML and 400 non-ML Python frameworks). We used the tool RefactoringMiner to extract Java and Python refactorings respectively, while we implement a custom Python script using GitHub API to extract the commits from their repositories. We further divided the commit and refactoring extraction into stage grouping (early, middle and late) using the emperical proposition of prior study.

Notably, this research is motivated to understand how refactoring differs per development stage in ML vs non-ML framework, and how these findings can benefit software engineering practitioners in the area of software quality assurance. This research is primarily quantitative as we only analyzed the refactorings that were extracted by RefactoringMiner.

Follow the guide below to reproduce our study findings: 

(A) Requirements

  • Build RefactoringMiner version 3.1.2 from Github to extract Python and Java refactorings
  • Build Lizard framework version 1.21.3 to evaluate code quality before and after refactoring   
  • Install all libraries in the requirements.txt

(B) Dataset Availability 

  • The zip file labeled: "Datasets_and_analysis" contains the extracted refactoring and commit JSON files for both Java and Python  frameworks.
  • All commits JSON can be located in the folder path ./project_commits", while refactoring .JSON can be located in folder "./raw_json_data".

(C) Implementation Steps

  • First,  generate a GitHub access token to have access to call the GitHub API when using RefactoringMiner, and our custom Python scripts. Then extract the data inside "Datasets_and_analysis" zip file  

Step 1: To run RefactoringMiner, you can build as a Gradle or Maven project. Although, we provide our custom Python script (refactoringminer.py) that integrates the RefactorinMiner's data extraction function in order to extract refactorings of multiple frameworks in a loop, you can as well visit their official Github page for more implementation details.

  • run refactoringminer.py to extract refactorings of studied subjects
  • run extract_commit.py to extract commits from Github of studied subjects

Step 2: run python file "./pull_data_raw_json.py" to extract the required nodes from the commits and refactoring JSON into CSV files inside the folder named "extracted_refactorings". 

Step 3: run "group_into_stages.py" to group the refactorings and commits into their respective development stages (early, middle and late).

Step 4: Analysis: We provide the implementation script for each RQ as presented below:  

  • Python scripts with prefix "rq1_", "rq2_", and "rq3_" represents implemetation scripts to analyze and generate figures for for RQ1, RQ2, and RQ3 respectively .
  • Other Python script without the prefix is labeled to identify their functionality 
  • lizard_refactoring_impact.py contains all the Python implementation script for Lizard framework which we used to estimate code quality of the subject systems before and after refactotoring

Step 5: Manual investigation records for RQ1, RQ2, and RQ3 can be found in the folder labeled "manual_verification" folder.

 

Note:

  • All Figures generated by the implementation scripts can be found inside the generated "./outputs" folder
  • Table data are generated based on analysis for each RQ

 

 

Files

Datasets_and_analysis.zip

Files (2.4 GB)

Name Size Download all
md5:55ccd5cbf770871129f44561175cddbd
2.4 GB Preview Download
md5:3a66d0fba58117b7ed9893ad99b21b21
7.9 kB Download
md5:731632e7a345c59d5cd624d547c7a39c
9.9 kB Download
md5:12e734de8d45a2b1ebd3d09cc40fdff0
10.2 kB Download
md5:bb1796e42ed69127fd5ab3206530bf68
11.0 kB Download
md5:0d66a602812e2c170c0f4015faf4f053
12.4 kB Download
md5:e2017acfd2f0bc02c9ddd7ef85f1024f
16.2 kB Download
md5:36655ffe277a389b0559c504f8ae947d
2.4 MB Preview Download
md5:c7e40dae61bebc87602d7d0e870084ee
23.3 kB Download
md5:34fcada8df0fd702075bcf0de2629347
403 Bytes Download
md5:2484681728f463745093e853c36dca3f
6.0 kB Download
md5:b45f3edf5a5a763c344d0df2067b65e3
1.5 kB Preview Download
md5:15f71e311b2509080b1fc78dd354a414
24.7 kB Download
md5:244be3d5296f654d8f3baab068ab6415
3.8 kB Download
md5:1e37034cb764d0b8c8a048e4d4523343
24.7 kB Download
md5:48f459d6290c38087bad6f3126d5c4b6
7.8 kB Download
md5:a301b075e711872897459caf16345775
11.4 kB Download
md5:dada6b883316605aadbf70de7199bc42
38.7 kB Download
md5:4f3b8d50be3398b501a9d6fde14642f5
38.7 kB Download
md5:2ba8afc02698854d19772d41aa915d4d
38.6 kB Download
md5:ea0872f3d79e262bd288bb18c5377dce
38.6 kB Download
md5:5a9df63fee0b101e4b6d1bea2d07d9d0
38.6 kB Download
md5:b4ce038a9dbf5c3a63791aae4dfa8056
38.7 kB Download