aikenkyu001/semantic_roundtrip_benchmark_2: Semantic Round-trip Benchmark v1.0.0 — Tools and Dataset for Evaluating Iterative Stability in Language Models
Authors/Creators
Description
Semantic Round-trip Benchmark v1.0.0
Tools and Dataset for Evaluating Iterative Stability in Language Models
Overview
This release provides version 1.0.0 of the Semantic Round-trip Benchmark, a framework and dataset designed to evaluate Iterative Stability in language models. The benchmark measures a model's ability to maintain semantic and functional consistency across repeated code–specification transformations, revealing failure modes that single-pass evaluations cannot capture.
This version includes the full experimental pipeline, task suite, and the dataset used in the accompanying study.
Contents of the Release
Benchmark Framework
- Implementation of the 10‑cycle semantic round‑trip evaluation
- Deterministic execution pipeline with reproducible configurations
- Code‑to‑spec and spec‑to‑code transformation prompts
- Automated functional validation using unit tests
Task Suite
get_magic_number— baseline stabilityfizzbuzz— high‑prevalence memorized taskseparate_vowels_and_consonants— novel generalization task
Dataset
- Over 7,000 trial logs across 24 small language models
- Cycle‑by‑cycle records including:
- prompts
- raw model outputs
- parsed outputs
- validation results
- Metadata for each experimental run
Key Insights Enabled by This Release
- Identification of a significant generalization gap between familiar and novel tasks
- Evidence that strong performance on common benchmarks often reflects memorization rather than reasoning
- Degradation curves showing how semantic drift accumulates across iterations
- Iterative instability observed even in larger models
Purpose
This release supports research on:
- Iterative stability and multi‑step reasoning
- Memorization vs. generalization in language models
- Semantic drift and failure mode analysis
- Evaluation methodologies beyond single‑pass correctness
All materials are versioned and archived to ensure long‑term reproducibility.
A DOI will be assigned via Zenodo upon publication of this release.
Files
aikenkyu001/semantic_roundtrip_benchmark_2-v1.0.0.zip
Files
(10.7 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:ae8e15f3a19514b613490ba97e156e22
|
10.7 MB | Preview Download |
Additional details
Related works
- Is supplement to
- Software: https://github.com/aikenkyu001/semantic_roundtrip_benchmark_2/tree/v1.0.0 (URL)