Reproducibility Meta-Analysis of Divergent Qwen3 MATH Benchmarks: Evaluating Protocol Factors Behind a 75-Point Performance Spread
Description
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5\% and 17.0\%, respectively, which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully connected layers with a final 1000-way softmax. To make training faster, we used non-saturat
Research goal: Reproducibility meta-analysis: 2 independent publications report divergent Qwen3 performance on MATH with a 75.0 percentage-point spread (range 0.0%–75.0%). Source papers: "SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Mult…" (2025, 0.0%); "DiffCoT: Diffusion-styled Chain-of-Thought Reasoning in LLMs" (2026, 75.0%). Preliminary analysis suggests: The extreme discrepancy likely stems from SPIRAL evaluating a base pre-training checkpoint without mathematical instruction tuning or specific chain-of-thought prompting, whereas DiffCoT reports results on a model fine-tuned with its specialized diffusion-style reasoning framework. Additionally, the 0.0% score suggest… Systematically evaluate which evaluation protocol factors (model configuration, inference setup, quantization, tokenization, few-shot count, metric interpretation, or data-split selection) best explain the observed spread; identify the highest-confidence explanation supported by each paper's stated methodology; and assess whether the highest-reported score is reproducible under the conditions described by the lowest-reporting paper.
Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 9.5/10.
Notes
Files
paper.pdf
Files
(83.6 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:39eddbe8d89305001b3d9c1a5f0db698
|
83.6 kB | Preview Download |