Published December 13, 2024 | Version v2
Dataset Open

MaRV Scripts and Dataset

Description

Contacts:

website: https://labsoft-ufmg.github.io/

email: henrique.mg.bh@gmail.com

The MaRV dataset consists of 693 manually evaluated code pairs extracted from 126 GitHub Java repositories, covering four types of refactoring. The dataset also includes metadata describing the refactored elements. Each code pair was assessed by two reviewers selected from a pool of 40 participants. The MaRV dataset is continuously evolving and is supported by a web-based tool for evaluating refactoring representations. This dataset aims to enhance the accuracy and reliability of state-of-the-art models in refactoring tasks, such as refactoring candidate identification and code generation, by providing high-quality annotated data.

Our dataset is located at the path dataset/MaRV.json

The guidelines for replicating the study are provided below:

Requirements

1. Software Dependencies:

  • Python 3.10+ with packages in requirements.txt
  • Git: Required to clone repositories.
  • Java 17: RefactoringMiner requires Java 17 to perform the analysis.
  • PHP 8.0: Required to host the Web tool.
  • MySQL 8: Required to store the Web tool data.

2. Environment Variables:

  • Create a .env file based on .env.example in the src folder and set the variables:
    • CSV_PATH: Path to the CSV file containing the list of repositories to be processed.
    • CLONE_DIR: Directory where repositories will be cloned.
    • JAVA_PATH: Path to the Java executable.
    • REFACTORING_MINER_PATH: Path to RefactoringMiner.

Refactoring Technique Selection

1. Environment Setup:

  • Ensure all dependencies are installed. Install the required Python packages with:
    pip install -r requirements.txt
    

2. Configuring the Repositories CSV:

  • The CSV file specified in CSV_PATH should contain a column named name with GitHub repository names (format: username/repo).

3. Executing the Script:

  • Configure the environment variables in the .env file and set up the repositories CSV, then run:
    python3 src/run_rm.py
    
  • The RefactoringMiner output from the 126 repositories of our study is available at:
    https://zenodo.org/records/14395034

4. Script Behavior:

  • The script clones each repository listed in the CSV file into the directory specified by CLONE_DIR, retrieves the default branch, and runs RefactoringMiner to analyze it.
  • Results and Logs:
    • Analysis results from RefactoringMiner are saved as .json files in CLONE_DIR.
    • Logs for each repository, including error messages, are saved as .log files in the same directory.

5. Count Refactorings:

  • To count instances for each refactoring technique, run:
    python3 src/count_refactorings.py
    
  • The output CSV file, named refactoring_count_by_type_and_file, shows the number of refactorings for each technique, grouped by repository.

Data Gathering

  • To collect snippets before and after refactoring and their metadata, run:

    python3 src/diff.py '[refactoring technique]'
    

    Replace [refactoring technique] with the desired technique name (e.g., Extract Method).

  • The script creates a directory for each repository and subdirectories named with the commit SHA. Each commit may have one or more refactorings.

  • Dataset Availability:

    • The snippets and metadata from the 126 repositories of our study are available in the dataset directory.
  • To generate the SQL file for the Web tool, run:

    python3 src/generate_refactorings_sql.py
    

Web Tool for Manual Evaluation

  • The Web tool scripts are available in the web directory.
  • Populate the data/output/snippets folder with the output of src/diff.py.
  • Run the sql/create_database.sql script in your database.
  • Import the SQL file generated by src/generate_refactorings_sql.py.
  • Run dataset.php to generate the MaRV dataset file.
  • The MaRV dataset, generated by the Web tool, is available in the dataset directory of the replication package.

Files

replication_package.zip

Files (1.9 GB)

Name Size Download all
md5:13759d69fd5f14942834f37485a9754a
1.9 GB Preview Download