An Un-Leaked, Multi-Modal Benchmark and the Effective Value Metric: Measuring the Real-World Efficacy of Frontier Language Models

Davis, Jason

doi:10.5281/zenodo.20586608

Published June 8, 2026 | Version v1

Preprint Open

An Un-Leaked, Multi-Modal Benchmark and the Effective Value Metric: Measuring the Real-World Efficacy of Frontier Language Models

Davis, Jason¹

1. The meo-benchmark Project, meoadvisors.com

Public large-language-model (LLM) leaderboards are increasingly compromised by contamination: once a benchmark is published, its questions leak into the training corpora of the very models it is meant to measure, and headline accuracy inflates without any corresponding gain in capability. Crowd-sourced preference arenas avoid static contamination but introduce a different distortion - they are gameable and reward style over substance. We present meo-benchmark, a proprietary, un-leaked, multi-domain, multi-modal (text, visual, agentic-style) evaluation suite that runs a pinned roster of frontier models over a private holdout, grades objectively wherever a ground truth exists, and reserves a bias-controlled multi-lab jury only for genuinely open-ended items. Eleven domains span perceptual illusions, logic/math/CS, framework-application-under-bias, critical-thinking inference, theory-of-mind, multi-step state tracking, and four generator-as-oracle domains whose answers are computed by an embedded solver, yielding guaranteed-correct ground truth and an effectively infinite supply of un-leakable items. Beyond raw accuracy, we introduce Effective Value (V), a single metric that fuses intelligence, speed, financial cost, and the exponential error-cascade of deep agentic work. V encodes the thesis that error is penalized twice - chain-success probability falls and debugging friction rises - and that for autonomous workflows the cost of time dominates the cost of money. Evaluating 22 models on a 251-item holdout, gpt-5.5 (73.2%) and claude-opus-4.8 (70.8%) lead on accuracy while deepseek-v4-flash offers the best intelligence-per-dollar (56.2% at $0.0037/correct); critically, the V ranking inverts with task depth: fast models win one-shot tasks, but accurate models dominate deep chains, with one cheap-but-accurate model rising 14 ranks and one fast-but-flawed model falling 14 as depth grows. A cross-model statistical analysis (n=22) refutes the intuition that more reasoning tokens predict higher accuracy - they anti-correlate (Spearman rho = -0.54), because the strongest models are the most concise - and shows V is accuracy-anchored (rho = 0.91) yet regime-dependent in its cost sensitivity. We further report an epistemic-integrity (sycophancy) track that cleanly separates principled resistance from mere stubbornness, a negative result on cheap-ensemble fusion (with a denominator-artifact methodology lesson), and a license-aware data hub that unifies our first-party scores, V, and aggregated third-party benchmarks behind one sync-friendly API. The private holdout is never released.

Files

main.pdf

Files (133.4 kB)

Name	Size	Download all
main.pdf md5:ffdbd6468b2c12a05fcb1eba0d1138d5	133.4 kB	Preview Download

Additional details

Is supplemented by: https://www.meoadvisors.com (URL)

	All versions	This version
Views	1	1
Downloads	1	1
Data volume	133.4 kB	133.4 kB

An Un-Leaked, Multi-Modal Benchmark and the Effective Value Metric: Measuring the Real-World Efficacy of Frontier Language Models

Authors/Creators

Description

Files

main.pdf

Files (133.4 kB)

Additional details

Related works