Published January 8, 2026 | Version v1.0.0
Software Open

aikenkyu001/semantic_roundtrip_benchmark_2: Semantic Round-trip Benchmark v1.0.0 — Tools and Dataset for Evaluating Iterative Stability in Language Models

Authors/Creators

Description

Semantic Round-trip Benchmark v1.0.0

Tools and Dataset for Evaluating Iterative Stability in Language Models

Overview

This release provides version 1.0.0 of the Semantic Round-trip Benchmark, a framework and dataset designed to evaluate Iterative Stability in language models. The benchmark measures a model's ability to maintain semantic and functional consistency across repeated code–specification transformations, revealing failure modes that single-pass evaluations cannot capture.

This version includes the full experimental pipeline, task suite, and the dataset used in the accompanying study.

Contents of the Release

Benchmark Framework

  • Implementation of the 10‑cycle semantic round‑trip evaluation
  • Deterministic execution pipeline with reproducible configurations
  • Code‑to‑spec and spec‑to‑code transformation prompts
  • Automated functional validation using unit tests

Task Suite

  • get_magic_number — baseline stability
  • fizzbuzz — high‑prevalence memorized task
  • separate_vowels_and_consonants — novel generalization task

Dataset

  • Over 7,000 trial logs across 24 small language models
  • Cycle‑by‑cycle records including:
    • prompts
    • raw model outputs
    • parsed outputs
    • validation results
  • Metadata for each experimental run

Key Insights Enabled by This Release

  • Identification of a significant generalization gap between familiar and novel tasks
  • Evidence that strong performance on common benchmarks often reflects memorization rather than reasoning
  • Degradation curves showing how semantic drift accumulates across iterations
  • Iterative instability observed even in larger models

Purpose

This release supports research on:

  • Iterative stability and multi‑step reasoning
  • Memorization vs. generalization in language models
  • Semantic drift and failure mode analysis
  • Evaluation methodologies beyond single‑pass correctness

All materials are versioned and archived to ensure long‑term reproducibility.
A DOI will be assigned via Zenodo upon publication of this release.

Files

aikenkyu001/semantic_roundtrip_benchmark_2-v1.0.0.zip

Files (10.7 MB)

Additional details