A survey of Document understanding and question answering
Authors/Creators
Description
Documents (paper media, images, or electronic files containing textual and graphical information) are ubiquitous in daily office work, online dissemination, and governmental/enterprise workflows; automatically parsing, retrieving, and supporting decisions based on their content constitutes the core demand of Document Intelligence. In real-world settings, a large portion of documents are first converted into images via scanning or photographing; consequently, document processing typically starts from visual inputs: on the one hand, the system must accurately localize layout elements such as text blocks, tables, figures, and headings, and perform text detection and recognition; on the other hand, it must conduct higher-level semantic understanding and reasoning on top of structured inputs to answer queries, extract key information, and generate verifiable results. With the advances of deep learning and large language models, the research focus has gradually expanded from ``character-level recognition'' to ``document-level understanding and question answering'' across regions, pages, and modalities, and industry has correspondingly formed evaluation demands that emphasize end-to-end capability and robustness.
Following this evolution, this paper organizes the survey in the order of ``perception first, then reasoning, and finally end-to-end unified modeling'', and aligns evaluation suites with methodological lineages. Part I (Document Layout Analysis + OCR) focuses on layout parsing starting from page geometry, including detection/segmentation-based layout element recognition, layout-aware models that incorporate layout information into pretrained representation learning, and robustness and cross-domain generalization under real-world document distributions; it further reviews key technical points of OCR from traditional pipelines to deep learning and lightweight deployment, emphasizing error propagation and engineering constraints when OCR serves as the ``entry point'' of inputs to downstream understanding tasks. Part II (Document Understanding and Question Answering) systematically summarizes three mainstream scenarios built on structured inputs: structure-aware representations for tables, modular/executable reasoning and verifiable question answering; retrieval-augmented generation (RAG) and reasoning-driven retrieval for long texts; and multimodal pretraining and instruction alignment for visually rich documents, the OCR-free evolution, and multi-page long-context modeling. Finally, we summarize the datasets and benchmarks associated with these two stages, covering task settings from single-page to multi-page documents and from closed sets to real-world enterprise documents, providing reproducible comparative baselines and a systematic research roadmap for future studies.
Files
survey.pdf
Files
(1.9 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:0e95edb275ab7f85be436f8b129d9d0a
|
1.9 MB | Preview Download |