aikenkyu001/semantic_roundtrip_benchmark_2: Semantic Round-trip Benchmark v1.0.0 — Tools and Dataset for Evaluating Iterative Stability in Language Models

Miyata, Fumio

doi:10.5281/zenodo.18181174

Published January 8, 2026 | Version v1.0.0

Software Open

aikenkyu001/semantic_roundtrip_benchmark_2: Semantic Round-trip Benchmark v1.0.0 — Tools and Dataset for Evaluating Iterative Stability in Language Models

Miyata, Fumio

Semantic Round-trip Benchmark v1.0.0

Tools and Dataset for Evaluating Iterative Stability in Language Models

Overview

This release provides version 1.0.0 of the Semantic Round-trip Benchmark, a framework and dataset designed to evaluate Iterative Stability in language models. The benchmark measures a model's ability to maintain semantic and functional consistency across repeated code–specification transformations, revealing failure modes that single-pass evaluations cannot capture.

This version includes the full experimental pipeline, task suite, and the dataset used in the accompanying study.

Contents of the Release

Benchmark Framework

Implementation of the 10‑cycle semantic round‑trip evaluation
Deterministic execution pipeline with reproducible configurations
Code‑to‑spec and spec‑to‑code transformation prompts
Automated functional validation using unit tests

Task Suite

get_magic_number — baseline stability
fizzbuzz — high‑prevalence memorized task
separate_vowels_and_consonants — novel generalization task

Dataset

Over 7,000 trial logs across 24 small language models
Cycle‑by‑cycle records including:
- prompts
- raw model outputs
- parsed outputs
- validation results
Metadata for each experimental run

Key Insights Enabled by This Release

Identification of a significant generalization gap between familiar and novel tasks
Evidence that strong performance on common benchmarks often reflects memorization rather than reasoning
Degradation curves showing how semantic drift accumulates across iterations
Iterative instability observed even in larger models

Purpose

This release supports research on:

Iterative stability and multi‑step reasoning
Memorization vs. generalization in language models
Semantic drift and failure mode analysis
Evaluation methodologies beyond single‑pass correctness

All materials are versioned and archived to ensure long‑term reproducibility.
A DOI will be assigned via Zenodo upon publication of this release.

Files

aikenkyu001/semantic_roundtrip_benchmark_2-v1.0.0.zip

Files (10.7 MB)

Name	Size	Download all
aikenkyu001/semantic_roundtrip_benchmark_2-v1.0.0.zip md5:ae8e15f3a19514b613490ba97e156e22	10.7 MB	Preview Download

Additional details

Is supplement to: Software: https://github.com/aikenkyu001/semantic_roundtrip_benchmark_2/tree/v1.0.0 (URL)

Repository URL: https://github.com/aikenkyu001/semantic_roundtrip_benchmark_2

	All versions	This version
Views	97	97
Downloads	2	2
Data volume	21.3 MB	21.3 MB

aikenkyu001/semantic_roundtrip_benchmark_2: Semantic Round-trip Benchmark v1.0.0 — Tools and Dataset for Evaluating Iterative Stability in Language Models

Authors/Creators

Description

Semantic Round-trip Benchmark v1.0.0

Overview

Contents of the Release

Benchmark Framework

Task Suite

Dataset

Key Insights Enabled by This Release

Purpose

Files

aikenkyu001/semantic_roundtrip_benchmark_2-v1.0.0.zip

Files (10.7 MB)

Additional details

Related works

Software