Published May 16, 2026
| Version 1.0.0
Dataset
Open
Synthetic Java Refactoring Dataset Generated with Large Language Models across Seven Refactoring Types
Description
This dataset contains 8,568 synthetic Java refactoring instances generated by Large Language Models (LLMs) as part of a Bachelor End Project (2ICS00) at Eindhoven University of Technology. The dataset is the output of a four-phase study evaluating which combination of model, prompting strategy, and code-context level produces the highest-quality synthetic refactorings, measured against a ground-truth corpus of 2,796 real-world Java refactorings.
Seven refactoring types are covered: Extract Method, Rename Method, Rename Parameter, Add Parameter, Remove Parameter, Change Return Type, and Change Method Access Modifier.
Each instance is a JSON file containing the raw and code-extracted LLM output, per-call token counts and latency, and eight evaluation metrics across three tiers: Tier 1 syntactic validity (JavaParser), Tier 2 textual/structural similarity (exact match, normalized edit distance, BLEU-4, ChrF, CodeBLEU), and Tier 3 refactoring correctness (RefactoringMiner 3.0: refactoring detection and refactoring-type match).
Files
bep-synthetic-refactoring-llm-v1.0.0.zip
Files
(31.7 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:9eb27f8ff9314c95f73fc00370e188ad
|
31.7 MB | Preview Download |