Differentiating Refactoring Practices: A Comparative Analysis of ML and Non-ML Frameworks

Timilehin, Ogundare; Lamothe, Maxime

doi:10.5281/zenodo.19611353

Published April 15, 2026 | Version v2

Publication Open

Differentiating Refactoring Practices: A Comparative Analysis of ML and Non-ML Frameworks

Research Overview:
This research is a longitudinal empirical study of refactoring practices between machine learning (ML) frameworks and non-ML frameworks in the field of software engineering. We mined over 1-million commits and over 4-million refactorings from 130 ML frameworks and 800 non-ML frameworks (i.e. 65 ML and 400 non-ML Java frameworks; 65-ML and 400 non-ML Python frameworks). We used the tool RefactoringMiner to extract Java and Python refactorings respectively, while we implement a custom Python script using GitHub API to extract the commits from their repositories. We further divided the commit and refactoring extraction into stage grouping (early, middle and late) using the emperical proposition of prior study.

Notably, this research is motivated to understand how refactoring differs per development stage in ML vs non-ML framework, and how these findings can benefit software engineering practitioners in the area of software quality assurance. This research is primarily quantitative as we only analyzed the refactorings that were extracted by RefactoringMiner.

Follow the guide below to reproduce our study findings:

(A) Requirements

Build RefactoringMiner version 3.1.2 from Github to extract Python and Java refactorings
Build Lizard framework version 1.21.3 to evaluate code quality before and after refactoring
Install all libraries in the requirements.txt

(B) Dataset Availability

The zip file labeled: "Datasets_and_analysis" contains the extracted refactoring and commit JSON files for both Java and Python frameworks.
All commits JSON can be located in the folder path ./project_commits", while refactoring .JSON can be located in folder "./raw_json_data".

(C) Implementation Steps

First, generate a GitHub access token to have access to call the GitHub API when using RefactoringMiner, and our custom Python scripts. Then extract the data inside "Datasets_and_analysis" zip file

Step 1: To run RefactoringMiner, you can build as a Gradle or Maven project. Although, we provide our custom Python script (refactoringminer.py) that integrates the RefactorinMiner's data extraction function in order to extract refactorings of multiple frameworks in a loop, you can as well visit their official Github page for more implementation details.

run refactoringminer.py to extract refactorings of studied subjects
run extract_commit.py to extract commits from Github of studied subjects

Step 2: run python file "./pull_data_raw_json.py" to extract the required nodes from the commits and refactoring JSON into CSV files inside the folder named "extracted_refactorings".

Step 3: run "group_into_stages.py" to group the refactorings and commits into their respective development stages (early, middle and late).

Step 4: Analysis: We provide the implementation script for each RQ as presented below:

Python scripts with prefix "rq1_", "rq2_", and "rq3_" represents implemetation scripts to analyze and generate figures for for RQ1, RQ2, and RQ3 respectively .
Other Python script without the prefix is labeled to identify their functionality
lizard_refactoring_impact.py contains all the Python implementation script for Lizard framework which we used to estimate code quality of the subject systems before and after refactotoring

Step 5: Manual investigation records for RQ1, RQ2, and RQ3 can be found in the folder labeled "manual_verification" folder.

Note:

All Figures generated by the implementation scripts can be found inside the generated "./outputs" folder
Table data are generated based on analysis for each RQ

Files

Datasets_and_analysis.zip

Files (2.4 GB)

Name	Size	Download all
Datasets_and_analysis.zip md5:55ccd5cbf770871129f44561175cddbd	2.4 GB	Preview Download
extract_commits.py md5:3a66d0fba58117b7ed9893ad99b21b21	7.9 kB	Download
global_utility.py md5:731632e7a345c59d5cd624d547c7a39c	9.9 kB	Download
group_into_stages.py md5:12e734de8d45a2b1ebd3d09cc40fdff0	10.2 kB	Download
lizard_refactoring_impact.py md5:bb1796e42ed69127fd5ab3206530bf68	11.0 kB	Download
manual_investigation_rq1_and rq2.py md5:0d66a602812e2c170c0f4015faf4f053	12.4 kB	Download
manual_investigation_rq3.py md5:e2017acfd2f0bc02c9ddd7ef85f1024f	16.2 kB	Download
Manual_verification.zip md5:36655ffe277a389b0559c504f8ae947d	2.4 MB	Preview Download
pull_data_raw_json.py md5:c7e40dae61bebc87602d7d0e870084ee	23.3 kB	Download
refactoring_analysis.iml md5:34fcada8df0fd702075bcf0de2629347	403 Bytes	Download
refactorinminer.py md5:2484681728f463745093e853c36dca3f	6.0 kB	Download
requirements.txt md5:b45f3edf5a5a763c344d0df2067b65e3	1.5 kB	Preview Download
rq1_clustering_tendencies.py md5:15f71e311b2509080b1fc78dd354a414	24.7 kB	Download
rq1_frequency_distribution_analysis.py md5:244be3d5296f654d8f3baab068ab6415	3.8 kB	Download
rq1_refactoring_distribution.py md5:1e37034cb764d0b8c8a048e4d4523343	24.7 kB	Download
rq2_analysis_all_stage_refactorings.py md5:48f459d6290c38087bad6f3126d5c4b6	7.8 kB	Download
rq2_analysis_refactorings_per_stage.py md5:a301b075e711872897459caf16345775	11.4 kB	Download
rq3_distribution_api_breaking_developers.py md5:dada6b883316605aadbf70de7199bc42	38.7 kB	Download
rq3_distribution_api_breaking_refactoring.py md5:4f3b8d50be3398b501a9d6fde14642f5	38.7 kB	Download
rq3_distribution_api_breaking_refactoring_commit.py md5:2ba8afc02698854d19772d41aa915d4d	38.6 kB	Download
rq3_distribution_fixing_api_breaking_changes.py md5:ea0872f3d79e262bd288bb18c5377dce	38.6 kB	Download
rq3_top_refactoring_types__fix_api_breaking_changes.py md5:5a9df63fee0b101e4b6d1bea2d07d9d0	38.6 kB	Download
rq3_top_refactoring_types_api_breaking.py md5:b4ce038a9dbf5c3a63791aae4dfa8056	38.7 kB	Download

	All versions	This version
Views	121	52
Downloads	56	13
Data volume	6.5 GB	258.6 kB

Differentiating Refactoring Practices: A Comparative Analysis of ML and Non-ML Frameworks

Authors/Creators

Description

Files

Datasets_and_analysis.zip

Files (2.4 GB)