Published May 16, 2026 | Version 1.0.0

Synthetic Java Refactoring Dataset Generated with Large Language Models across Seven Refactoring Types

Authors/Creators

  • 1. ROR icon Eindhoven University of Technology

Description

This dataset contains 8,568 synthetic Java refactoring instances generated by Large Language Models (LLMs) as part of a Bachelor End Project (2ICS00) at Eindhoven University of Technology. The dataset is the output of a four-phase study evaluating which combination of model, prompting strategy, and code-context level produces the highest-quality synthetic refactorings, measured against a ground-truth corpus of 2,796 real-world Java refactorings.
 
Seven refactoring types are covered: Extract Method, Rename Method, Rename Parameter, Add Parameter, Remove Parameter, Change Return Type, and Change Method Access Modifier.
 
Each instance is a JSON file containing the raw and code-extracted LLM output, per-call token counts and latency, and eight evaluation metrics across three tiers: Tier 1 syntactic validity (JavaParser), Tier 2 textual/structural similarity (exact match, normalized edit distance, BLEU-4, ChrF, CodeBLEU), and Tier 3 refactoring correctness (RefactoringMiner 3.0: refactoring detection and refactoring-type match).

Files

bep-synthetic-refactoring-llm-v1.0.0.zip

Files (31.7 MB)

Name Size Download all
md5:9eb27f8ff9314c95f73fc00370e188ad
31.7 MB Preview Download