MaRV Scripts and Dataset

Nunes, Henrique; Sharma, Tushar; Figueiredo, Eduardo

doi:10.5281/zenodo.14450098

Published December 13, 2024 | Version v2

Dataset Open

MaRV Scripts and Dataset

Contacts:

website: https://labsoft-ufmg.github.io/

email: henrique.mg.bh@gmail.com

The MaRV dataset consists of 693 manually evaluated code pairs extracted from 126 GitHub Java repositories, covering four types of refactoring. The dataset also includes metadata describing the refactored elements. Each code pair was assessed by two reviewers selected from a pool of 40 participants. The MaRV dataset is continuously evolving and is supported by a web-based tool for evaluating refactoring representations. This dataset aims to enhance the accuracy and reliability of state-of-the-art models in refactoring tasks, such as refactoring candidate identification and code generation, by providing high-quality annotated data.

Our dataset is located at the path dataset/MaRV.json

The guidelines for replicating the study are provided below:

Requirements

1. Software Dependencies:

Python 3.10+ with packages in requirements.txt
Git: Required to clone repositories.
Java 17: RefactoringMiner requires Java 17 to perform the analysis.
PHP 8.0: Required to host the Web tool.
MySQL 8: Required to store the Web tool data.

2. Environment Variables:

Create a .env file based on .env.example in the src folder and set the variables:
- CSV_PATH: Path to the CSV file containing the list of repositories to be processed.
- CLONE_DIR: Directory where repositories will be cloned.
- JAVA_PATH: Path to the Java executable.
- REFACTORING_MINER_PATH: Path to RefactoringMiner.

Refactoring Technique Selection

1. Environment Setup:

Ensure all dependencies are installed. Install the required Python packages with:
```
pip install -r requirements.txt
```

2. Configuring the Repositories CSV:

The CSV file specified in CSV_PATH should contain a column named name with GitHub repository names (format: username/repo).

3. Executing the Script:

Configure the environment variables in the .env file and set up the repositories CSV, then run:
```
python3 src/run_rm.py
```
The RefactoringMiner output from the 126 repositories of our study is available at:
https://zenodo.org/records/14395034

4. Script Behavior:

The script clones each repository listed in the CSV file into the directory specified by CLONE_DIR, retrieves the default branch, and runs RefactoringMiner to analyze it.
Results and Logs:
- Analysis results from RefactoringMiner are saved as .json files in CLONE_DIR.
- Logs for each repository, including error messages, are saved as .log files in the same directory.

5. Count Refactorings:

To count instances for each refactoring technique, run:
```
python3 src/count_refactorings.py
```
The output CSV file, named refactoring_count_by_type_and_file, shows the number of refactorings for each technique, grouped by repository.

Data Gathering

To collect snippets before and after refactoring and their metadata, run:
```
python3 src/diff.py '[refactoring technique]'
```
Replace [refactoring technique] with the desired technique name (e.g., Extract Method).
The script creates a directory for each repository and subdirectories named with the commit SHA. Each commit may have one or more refactorings.
Dataset Availability:
- The snippets and metadata from the 126 repositories of our study are available in the dataset directory.
To generate the SQL file for the Web tool, run:
```
python3 src/generate_refactorings_sql.py
```

Web Tool for Manual Evaluation

The Web tool scripts are available in the web directory.
Populate the data/output/snippets folder with the output of src/diff.py.
Run the sql/create_database.sql script in your database.
Import the SQL file generated by src/generate_refactorings_sql.py.
Run dataset.php to generate the MaRV dataset file.
The MaRV dataset, generated by the Web tool, is available in the dataset directory of the replication package.

Files

replication_package.zip

Files (1.9 GB)

Name	Size	Download all
replication_package.zip md5:13759d69fd5f14942834f37485a9754a	1.9 GB	Preview Download

	All versions	This version
Views	207	202
Downloads	29	29
Data volume	124.8 GB	124.8 GB

MaRV Scripts and Dataset

Creators

Description

Requirements

1. Software Dependencies:

2. Environment Variables:

Refactoring Technique Selection

1. Environment Setup:

2. Configuring the Repositories CSV:

3. Executing the Script:

4. Script Behavior:

5. Count Refactorings:

Data Gathering

Web Tool for Manual Evaluation

Files

replication_package.zip

Files (1.9 GB)