Synthetic Java Refactoring Dataset Generated with Large Language Models across Seven Refactoring Types

Sójka, Bartosz

doi:10.5281/zenodo.20231231

Published May 16, 2026 | Version 1.0.0

Dataset Open

Synthetic Java Refactoring Dataset Generated with Large Language Models across Seven Refactoring Types

Sójka, Bartosz¹

1. Eindhoven University of Technology

This dataset contains 8,568 synthetic Java refactoring instances generated by Large Language Models (LLMs) as part of a Bachelor End Project (2ICS00) at Eindhoven University of Technology. The dataset is the output of a four-phase study evaluating which combination of model, prompting strategy, and code-context level produces the highest-quality synthetic refactorings, measured against a ground-truth corpus of 2,796 real-world Java refactorings.

Seven refactoring types are covered: Extract Method, Rename Method, Rename Parameter, Add Parameter, Remove Parameter, Change Return Type, and Change Method Access Modifier.

Each instance is a JSON file containing the raw and code-extracted LLM output, per-call token counts and latency, and eight evaluation metrics across three tiers: Tier 1 syntactic validity (JavaParser), Tier 2 textual/structural similarity (exact match, normalized edit distance, BLEU-4, ChrF, CodeBLEU), and Tier 3 refactoring correctness (RefactoringMiner 3.0: refactoring detection and refactoring-type match).

Files

bep-synthetic-refactoring-llm-v1.0.0.zip

Files (31.7 MB)

Name	Size	Download all
bep-synthetic-refactoring-llm-v1.0.0.zip md5:9eb27f8ff9314c95f73fc00370e188ad	31.7 MB	Preview Download

	All versions	This version
Views	157	157
Downloads	0	0
Data volume	0 Bytes	0 Bytes

Synthetic Java Refactoring Dataset Generated with Large Language Models across Seven Refactoring Types

Authors/Creators

Description

Files

bep-synthetic-refactoring-llm-v1.0.0.zip

Files (31.7 MB)