Published June 2, 2026 | Version v1.0
Preprint Open

OPERATE-R Freshness Routing Track v0.3.6: Route-First Evaluation for Temporal Volatility, Stale-Knowledge Control, and Core-500 Candidate Validation

Authors/Creators

  • 1. MOBIUS LLC

Description

This preprint introduces and reports the OPERATE-R Freshness Routing Track (OPERATE-FR), a route-first evaluation framework for temporal volatility, stale-knowledge control, and answer-entitlement behavior in AI assistants.

Unlike conventional answer-accuracy benchmarks, OPERATE-FR evaluates whether a system selects an appropriate epistemic route before answering: direct answer, verification, clarification, date-bounded answer, re-anchoring of stale premises, or abstention. The paper reports Smoke-100 Raw-vs-MMV evidence and integrates a later Core-500 candidate stress check across Small, Medium, and Large governed profiles.

The central claim is intentionally bounded. Smoke-100 supports a Raw-vs-MMV improvement-delta claim for route governance. Core-500 does not include a matched Raw control arm and is therefore used as governed-profile level evidence, robustness stress evidence, family-level heterogeneity evidence, and cost-side analysis, not as a large-N proof of governance improvement. Core-500 is a controlled 5x expansion of Smoke-100 using neutral prompt-frame variants; it should not be treated as 500 independent task families or as an independently validated public benchmark standard.

This v0.3.6 data-verified final manuscript incorporates post-audit verification of the Core-500 failure-side metrics. The equality between stale_commitment_rate and unsupported_current_claim_rate is confirmed not to be a manuscript copy error. The row-output JSONL files were re-read after Drive synchronization, and the derived row sets are identical with zero symmetric difference across Small, Medium, and Large lines. The labels remain conceptually distinguishable, but in the current Core-500 scorer they are structurally paired under the observed direct-current-claim-without-date-boundary-or-tool-use condition.

This record should be read as a working paper and candidate benchmark report. It does not claim an official leaderboard, a universal model-quality score, deployment-wide validation, or external benchmark standard status. Future work includes matched Core-500 Raw arms, route-classifier validation, independent labels, external baselines, clustered or hierarchical uncertainty estimates, and improved handling of volatile_current prompts.

Author of record and concept originator: Taiko Toeda.
Rights holder and licensing authority: MOBIUS LLC.

Files

OPERATE_FR_v0_3_6_Final_Manuscript_DataVerified.pdf

Files (563.8 kB)

Additional details

Software