Systematic Evaluation of Evaluation Protocol Factors Driving Extreme Qwen2.5 Performance Discrepancies on DocVQA
Description
Document Visual Question Answering (DocVQA) faces dual challenges in processing lengthy multimodal documents (text, images, tables) and performing cross-modal reasoning. Current document retrieval-augmented generation (DocRAG) methods remain limited by their text-centric approaches, frequently missing critical visual information. The field also lacks robust benchmarks for assessing multimodal evidence selection and integration. We introduce MMDocRAG, a comprehensive benchmark featuring 4,055 expert-annotated QA pairs with multi-page, cross-modal evidence chains. Our framework introduces innova
Research goal: Reproducibility meta-analysis: 3 independent publications report divergent Qwen2.5 performance on Docvqa with a 80.3 percentage-point spread (range 14.1%–94.3%). Source papers: "DocHop-QA: Towards Multi-Hop Reasoning over Multimodal Document Collections" (2025, 14.1%); "VisionSelector: End-to-End Learnable Visual Token Compression for Efficient Mul…" (2025, 94.3%); "VisionSelector: End-to-End Learnable Visual Token Compression for Efficient Mul…" (2025, 94.3%). Preliminary analysis suggests: The extreme discrepancy likely stems from DocHop-QA evaluating Qwen2.5 in a strict zero-shot setting on complex multi-hop reasoning tasks without fine-tuning, whereas VisionSelector reports scores from a model checkpoint that has been fine-tuned or augmented with their specific visual token compression module. Additio… Systematically evaluate which evaluation protocol factors (model configuration, inference setup, quantization, tokenization, few-shot count, metric interpretation, or data-split selection) best explain the observed spread; identify the highest-confidence explanation supported by each paper's stated methodology; and assess whether the highest-reported score is reproducible under the conditions described by the lowest-reporting paper.
Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 9.2/10.
Notes
Files
paper.pdf
Files
(83.4 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:228d9ffbd07ba1dd222fd120a988ce32
|
83.4 kB | Preview Download |