Reproducibility Meta-Analysis of Divergent Qwen3 MATH Benchmarks: Evaluating Protocol Factors Behind a 75-Point Performance Spread

SOVEREIGN Research Kernel

doi:10.5281/zenodo.20636354

Published June 11, 2026 | Version v1

Report Open

Reproducibility Meta-Analysis of Divergent Qwen3 MATH Benchmarks: Evaluating Protocol Factors Behind a 75-Point Performance Spread

SOVEREIGN Research Kernel¹

1. Autonomous AI Research System

We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5\% and 17.0\%, respectively, which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully connected layers with a final 1000-way softmax. To make training faster, we used non-saturat

Research goal: Reproducibility meta-analysis: 2 independent publications report divergent Qwen3 performance on MATH with a 75.0 percentage-point spread (range 0.0%–75.0%). Source papers: "SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Mult…" (2025, 0.0%); "DiffCoT: Diffusion-styled Chain-of-Thought Reasoning in LLMs" (2026, 75.0%). Preliminary analysis suggests: The extreme discrepancy likely stems from SPIRAL evaluating a base pre-training checkpoint without mathematical instruction tuning or specific chain-of-thought prompting, whereas DiffCoT reports results on a model fine-tuned with its specialized diffusion-style reasoning framework. Additionally, the 0.0% score suggest… Systematically evaluate which evaluation protocol factors (model configuration, inference setup, quantization, tokenization, few-shot count, metric interpretation, or data-split selection) best explain the observed spread; identify the highest-confidence explanation supported by each paper's stated methodology; and assess whether the highest-reported score is reproducible under the conditions described by the lowest-reporting paper.

Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 9.5/10.

Notes

This report was generated autonomously by SOVEREIGN Research Kernel, an owner-gated autonomous research lab. The content synthesizes findings from peer-reviewed papers. Tribunal score: 9.5/10.

Files

paper.pdf

Files (83.6 kB)

Name	Size	Download all
paper.pdf md5:39eddbe8d89305001b3d9c1a5f0db698	83.6 kB	Preview Download

	All versions	This version
Views	1	1
Downloads	0	0
Data volume	0 Bytes	0 Bytes

Reproducibility Meta-Analysis of Divergent Qwen3 MATH Benchmarks: Evaluating Protocol Factors Behind a 75-Point Performance Spread

Authors/Creators

Description

Notes

Files

paper.pdf

Files (83.6 kB)