Published March 9, 2026
| Version v1
Preprint
Open
Constellation: Lossless Document Structuring via Control-Data Flow Decoupling
Description
Accurately transforming unstructured documents into structured intermediate representations (such as Markdown or JSON) with hierarchical semantics is the core foundation for Retrieval-Augmented Generation (RAG), automated knowledge graph construction, and large language model (LLM) context engineering. However, traditional pure-rule-based parsing engines suffer from severe hierarchy loss when handling non-standard formatting. On the other hand, end-to-end LLM extraction paradigms face two insurmountable inherent flaws: first, uncontrollable generative hallucinations caused by autoregressive mechanisms lead to text alteration, making them unusable in strict domains requiring absolute fidelity; second, the output token cost for full-text rewriting is prohibitively high. To address these issues, we propose the \textbf{Constellation} document parsing architecture, which achieves strict decoupling of the semantic control flow and data transfer flow at the system level. The LLM acts solely as a "control center" to output low-dimensional spatial anchors and hierarchical judgments (control flow), while the underlying deterministic finite state machine (FSM) directly extracts and losslessly assembles the original characters based on these anchors (data flow). The parsed result is a native JSON document tree that can be zero-cost serialized into any downstream format. Over benchmark datasets ranging from 1.9M to 10.07M characters, the architecture not only achieves zero character-level loss at the physical extraction layer but also effectively resolves hierarchical misalignment in long texts. In horizontal comparisons with IBM Docling and Microsoft MarkItDown, Constellation achieved a hierarchy accuracy of 1.0000 in extreme non-standard scenarios lacking explicit heading styles, whereas rule-based schemes degraded to 0.0000. This research provides a new paradigm with low computational overhead and absolute fidelity for precise document parsing in the LLM era.
Files
main (5).pdf
Files
(2.9 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:942e13943eaa48903b9fbd96314d9026
|
2.9 MB | Preview Download |
Additional details
Software
- Repository URL
- https://github.com/1911342723/Constellation
- Programming language
- Python
- Development Status
- Active