Published March 9, 2026 | Version v1
Preprint Open

Constellation: Lossless Document Structuring via Control-Data Flow Decoupling

Authors/Creators

  • 1. ROR icon Southwest University

Description

Accurately transforming unstructured documents into structured intermediate representations (such as Markdown or JSON) with hierarchical semantics is the core foundation for Retrieval-Augmented Generation (RAG), automated knowledge graph construction, and large language model (LLM) context engineering. However, traditional pure-rule-based parsing engines suffer from severe hierarchy loss when handling non-standard formatting. On the other hand, end-to-end LLM extraction paradigms face two insurmountable inherent flaws: first, uncontrollable generative hallucinations caused by autoregressive mechanisms lead to text alteration, making them unusable in strict domains requiring absolute fidelity; second, the output token cost for full-text rewriting is prohibitively high. To address these issues, we propose the \textbf{Constellation} document parsing architecture, which achieves strict decoupling of the semantic control flow and data transfer flow at the system level. The LLM acts solely as a "control center" to output low-dimensional spatial anchors and hierarchical judgments (control flow), while the underlying deterministic finite state machine (FSM) directly extracts and losslessly assembles the original characters based on these anchors (data flow). The parsed result is a native JSON document tree that can be zero-cost serialized into any downstream format. Over benchmark datasets ranging from 1.9M to 10.07M characters, the architecture not only achieves zero character-level loss at the physical extraction layer but also effectively resolves hierarchical misalignment in long texts. In horizontal comparisons with IBM Docling and Microsoft MarkItDown, Constellation achieved a hierarchy accuracy of 1.0000 in extreme non-standard scenarios lacking explicit heading styles, whereas rule-based schemes degraded to 0.0000. This research provides a new paradigm with low computational overhead and absolute fidelity for precise document parsing in the LLM era.

Files

main (5).pdf

Files (2.9 MB)

Name Size Download all
md5:942e13943eaa48903b9fbd96314d9026
2.9 MB Preview Download

Additional details

Software

Repository URL
https://github.com/1911342723/Constellation
Programming language
Python
Development Status
Active