A survey of Document understanding and question answering

Yang, JiaShu; Zhang, chi; Zhang, Ning; Wu, Jie; Chen, Lingxu; Guo, Jiani

doi:10.5281/zenodo.19364133

Published April 1, 2026 | Version v1

Preprint Open

A survey of Document understanding and question answering

Documents (paper media, images, or electronic files containing textual and graphical information) are ubiquitous in daily office work, online dissemination, and governmental/enterprise workflows; automatically parsing, retrieving, and supporting decisions based on their content constitutes the core demand of Document Intelligence. In real-world settings, a large portion of documents are first converted into images via scanning or photographing; consequently, document processing typically starts from visual inputs: on the one hand, the system must accurately localize layout elements such as text blocks, tables, figures, and headings, and perform text detection and recognition; on the other hand, it must conduct higher-level semantic understanding and reasoning on top of structured inputs to answer queries, extract key information, and generate verifiable results. With the advances of deep learning and large language models, the research focus has gradually expanded from ``character-level recognition'' to ``document-level understanding and question answering'' across regions, pages, and modalities, and industry has correspondingly formed evaluation demands that emphasize end-to-end capability and robustness.

Following this evolution, this paper organizes the survey in the order of ``perception first, then reasoning, and finally end-to-end unified modeling'', and aligns evaluation suites with methodological lineages. Part I (Document Layout Analysis + OCR) focuses on layout parsing starting from page geometry, including detection/segmentation-based layout element recognition, layout-aware models that incorporate layout information into pretrained representation learning, and robustness and cross-domain generalization under real-world document distributions; it further reviews key technical points of OCR from traditional pipelines to deep learning and lightweight deployment, emphasizing error propagation and engineering constraints when OCR serves as the ``entry point'' of inputs to downstream understanding tasks. Part II (Document Understanding and Question Answering) systematically summarizes three mainstream scenarios built on structured inputs: structure-aware representations for tables, modular/executable reasoning and verifiable question answering; retrieval-augmented generation (RAG) and reasoning-driven retrieval for long texts; and multimodal pretraining and instruction alignment for visually rich documents, the OCR-free evolution, and multi-page long-context modeling. Finally, we summarize the datasets and benchmarks associated with these two stages, covering task settings from single-page to multi-page documents and from closed sets to real-world enterprise documents, providing reproducible comparative baselines and a systematic research roadmap for future studies.

Files

survey.pdf

Files (1.9 MB)

Name	Size	Download all
survey.pdf md5:0e95edb275ab7f85be436f8b129d9d0a	1.9 MB	Preview Download

	All versions	This version
Views	84	84
Downloads	53	53
Data volume	115.3 MB	115.3 MB

A survey of Document understanding and question answering

Authors/Creators

Description

Files

survey.pdf

Files (1.9 MB)