Published April 25, 2026 | Version v1
Data paper Open

Neomundi ControlTower™ — Cross-Provider Thermodynamic Stability Benchmark (v1, April 2026)

Authors/Creators

Description

Empirical measurement of token-level generative stability across five Large Language Model providers, observed in real time by the Neomundi ControlTower™ governance layer.

This dataset reports 3,905 governance decisions produced by ControlTower™ over 782 unique factual prompts, evaluated against ground-truth labels by an independent GPT-4o judge. The objective is to quantify the dispersion of thermodynamic stability — measured by the G-Score and its variation ΔG — across heterogeneous LLM providers when answering the same prompt set.

METHODOLOGY

- Prompts: 782 factual questions derived from the HaluEval benchmark.
- Providers: five LLM providers covering U.S. and European stacks, anonymised as P-001 to P-005 to preserve commercial neutrality.
- Governance layer: every response is processed token-by-token by ControlTower™ (Law E™ framework), producing a stability score (G-Score in [0,1]), a regime classification, a ΔG profile (FLAT / DROP), and an ALLOW/FLAG decision without reading semantic content.
- Ground truth: an independent GPT-4o LLM-judge labels each response as CORRECT / INCORRECT against reference answers.

SCHEMA

One row per (provider, question) pair. Twelve columns:

- provider_id    — anonymised provider (P-001 to P-005)
- question_id    — prompt identifier (TQ-XXXX)
- decision       — ControlTower™ output: ALLOW or FLAG
- g_score        — stability score in [0,1]
- regime         — qualitative regime (STABLE in this run)
- dg_profile     — ΔG profile: FLAT (no perturbation) or DROP (instability detected)
- dg_flagged     — boolean flag derived from ΔG analysis
- dg_variation   — magnitude of ΔG variation
- judge_verdict  — independent GPT-4o judge: CORRECT / INCORRECT / ERROR
- is_correct     — boolean form of the judge verdict
- response_hash  — SHA-256 prefix of the raw model response (auditability)
- cost_usd       — observed inference cost in USD

AGGREGATE FINDINGS

FLAG rates across the five providers range from 3.72 % (most stable) to 21.48 % (most unstable) — a ~5.8× spread on identical prompts. Provider-level accuracy (judge verdict) ranges from 35.5 % to 59.9 %. The most thermodynamically stable provider is also the most factually accurate, supporting the working hypothesis: "the gap is not the model, it is the governance."

INTEGRITY

SHA-256 of measurements.csv:
b432985de9ba3faf8fc8d610b514ce8ba5a457e4eef2f25ab1f3da735053b71b

RELATED WORK

- Favre-Lecca, S. (2026). Thermodynamic Governance of AI Systems v2.0 — The Four Conditions of Artificial Coherence. Zenodo. doi:10.5281/zenodo.17705536
- Law E™ founding act (2025), OpenTimestamps-anchored on Bitcoin block 917643. Zenodo. doi:10.5281/zenodo.19385052
- Source repository: https://github.com/neomundi-io/neomundi-benchmarks

LICENSE & CITATION

Released under CC BY 4.0. Please cite this dataset when reusing the data or replicating the methodology.

Files

measurements.csv

Files (404.4 kB)

Name Size Download all
md5:9b108d963f397faa0ed9c84018e8f029
404.4 kB Preview Download

Additional details

Dates

Copyrighted
2026-04-25