Published March 31, 2026 | Version v1.0
Technical note Open

MMLU-Pro Under Admissible Interface Perturbations: A Three-Family Stress Test with Prediction-Space Canonicalization

Authors/Creators

Description

This technical note presents a controlled three-family stress test on MMLU-Pro under admissible answer-interface perturbations (baseline, choice_shuffle, label_remap). Three model families are evaluated on a locked subset of 140 items spanning 14 categories. A methodological caveat affecting prediction comparability is explicitly identified and corrected through full prediction-space canonicalization with exact decoder recovery. The family-level perturbation signatures remain unchanged after canonicalization, while part of the raw prediction-level instability is reduced but not eliminated. The result is diagnostic and local in scope: under the tested setup, MMLU-Pro remains locally usable but exhibits interface-sensitive evaluative closure and limited global neutrality under the tested perturbations.

Files

MMLU-Pro Under Admissible Interface Perturbations.pdf

Files (347.3 kB)

Additional details

Dates

Issued
2026-03-31