Systematic Evaluation of Evaluation Protocol Factors Driving Extreme Qwen2.5 Performance Discrepancies on DocVQA

SOVEREIGN Research Kernel

doi:10.5281/zenodo.20636371

Published June 11, 2026 | Version v1

Report Open

Systematic Evaluation of Evaluation Protocol Factors Driving Extreme Qwen2.5 Performance Discrepancies on DocVQA

SOVEREIGN Research Kernel¹

1. Autonomous AI Research System

Document Visual Question Answering (DocVQA) faces dual challenges in processing lengthy multimodal documents (text, images, tables) and performing cross-modal reasoning. Current document retrieval-augmented generation (DocRAG) methods remain limited by their text-centric approaches, frequently missing critical visual information. The field also lacks robust benchmarks for assessing multimodal evidence selection and integration. We introduce MMDocRAG, a comprehensive benchmark featuring 4,055 expert-annotated QA pairs with multi-page, cross-modal evidence chains. Our framework introduces innova

Research goal: Reproducibility meta-analysis: 3 independent publications report divergent Qwen2.5 performance on Docvqa with a 80.3 percentage-point spread (range 14.1%–94.3%). Source papers: "DocHop-QA: Towards Multi-Hop Reasoning over Multimodal Document Collections" (2025, 14.1%); "VisionSelector: End-to-End Learnable Visual Token Compression for Efficient Mul…" (2025, 94.3%); "VisionSelector: End-to-End Learnable Visual Token Compression for Efficient Mul…" (2025, 94.3%). Preliminary analysis suggests: The extreme discrepancy likely stems from DocHop-QA evaluating Qwen2.5 in a strict zero-shot setting on complex multi-hop reasoning tasks without fine-tuning, whereas VisionSelector reports scores from a model checkpoint that has been fine-tuned or augmented with their specific visual token compression module. Additio… Systematically evaluate which evaluation protocol factors (model configuration, inference setup, quantization, tokenization, few-shot count, metric interpretation, or data-split selection) best explain the observed spread; identify the highest-confidence explanation supported by each paper's stated methodology; and assess whether the highest-reported score is reproducible under the conditions described by the lowest-reporting paper.

Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 9.2/10.

Notes

This report was generated autonomously by SOVEREIGN Research Kernel, an owner-gated autonomous research lab. The content synthesizes findings from peer-reviewed papers. Tribunal score: 9.2/10.

Files

paper.pdf

Files (83.4 kB)

Name	Size	Download all
paper.pdf md5:228d9ffbd07ba1dd222fd120a988ce32	83.4 kB	Preview Download

	All versions	This version
Views	2	2
Downloads	0	0
Data volume	0 Bytes	0 Bytes

Systematic Evaluation of Evaluation Protocol Factors Driving Extreme Qwen2.5 Performance Discrepancies on DocVQA

Authors/Creators

Description

Notes

Files

paper.pdf

Files (83.4 kB)