MaRV Scripts and Dataset
Description
Contacts:
website: https://labsoft-ufmg.github.io/
email: henrique.mg.bh@gmail.com
The MaRV dataset consists of 693 manually evaluated code pairs extracted from 126 GitHub Java repositories, covering four types of refactoring. The dataset also includes metadata describing the refactored elements. Each code pair was assessed by two reviewers selected from a pool of 40 participants. The MaRV dataset is continuously evolving and is supported by a web-based tool for evaluating refactoring representations. This dataset aims to enhance the accuracy and reliability of state-of-the-art models in refactoring tasks, such as refactoring candidate identification and code generation, by providing high-quality annotated data.
Our dataset is located at the path dataset/MaRV.json
The guidelines for replicating the study are provided below:
Requirements
1. Software Dependencies:
- Python 3.10+ with packages in
requirements.txt
- Git: Required to clone repositories.
- Java 17: RefactoringMiner requires Java 17 to perform the analysis.
- PHP 8.0: Required to host the Web tool.
- MySQL 8: Required to store the Web tool data.
2. Environment Variables:
- Create a
.env
file based on.env.example
in thesrc
folder and set the variables:CSV_PATH
: Path to the CSV file containing the list of repositories to be processed.CLONE_DIR
: Directory where repositories will be cloned.JAVA_PATH
: Path to the Java executable.REFACTORING_MINER_PATH
: Path to RefactoringMiner.
Refactoring Technique Selection
1. Environment Setup:
- Ensure all dependencies are installed. Install the required Python packages with:
pip install -r requirements.txt
2. Configuring the Repositories CSV:
- The CSV file specified in
CSV_PATH
should contain a column namedname
with GitHub repository names (format:username/repo
).
3. Executing the Script:
- Configure the environment variables in the
.env
file and set up the repositories CSV, then run:python3 src/run_rm.py
- The RefactoringMiner output from the 126 repositories of our study is available at:
https://zenodo.org/records/14395034
4. Script Behavior:
- The script clones each repository listed in the CSV file into the directory specified by
CLONE_DIR
, retrieves the default branch, and runs RefactoringMiner to analyze it. - Results and Logs:
- Analysis results from RefactoringMiner are saved as
.json
files inCLONE_DIR
. - Logs for each repository, including error messages, are saved as
.log
files in the same directory.
- Analysis results from RefactoringMiner are saved as
5. Count Refactorings:
- To count instances for each refactoring technique, run:
python3 src/count_refactorings.py
- The output CSV file, named
refactoring_count_by_type_and_file
, shows the number of refactorings for each technique, grouped by repository.
Data Gathering
-
To collect snippets before and after refactoring and their metadata, run:
python3 src/diff.py '[refactoring technique]'
Replace
[refactoring technique]
with the desired technique name (e.g.,Extract Method
). -
The script creates a directory for each repository and subdirectories named with the commit SHA. Each commit may have one or more refactorings.
-
Dataset Availability:
- The snippets and metadata from the 126 repositories of our study are available in the
dataset
directory.
- The snippets and metadata from the 126 repositories of our study are available in the
-
To generate the SQL file for the Web tool, run:
python3 src/generate_refactorings_sql.py
Web Tool for Manual Evaluation
- The Web tool scripts are available in the
web
directory. - Populate the
data/output/snippets
folder with the output ofsrc/diff.py
. - Run the
sql/create_database.sql
script in your database. - Import the SQL file generated by
src/generate_refactorings_sql.py
. - Run
dataset.php
to generate the MaRV dataset file. - The MaRV dataset, generated by the Web tool, is available in the
dataset
directory of the replication package.
Files
replication_package.zip
Files
(1.9 GB)
Name | Size | Download all |
---|---|---|
md5:13759d69fd5f14942834f37485a9754a
|
1.9 GB | Preview Download |