Published February 9, 2026 | Version 1.0.0
Preprint Open

Convert Once, Consume Many: SDF for Cacheable, Typed Semantic Extraction from Web Pages

Description

AI agents and large language models (LLMs) typically consume web pages by fetching HTML and repeatedly performing boilerplate removal, chunking, and semantic extraction at query time. We present Structured Data Format (SDF), an open, schema-validated JSON protocol for publishing pre-extracted, agent-oriented semantic representations of web content. A production crawler processed 6,206 URLs and generated 2,335+ schema-valid SDF documents across 10 parent types and 74 observed type combinations. A fine-tuned 1.5B + 3B pipeline achieves 4.1x latency reduction versus a 14B baseline with 90% exact extraction accuracy. A downstream consumption experiment showed the SDF path achieved 0.739 mean accuracy versus 0.352 raw at 7B (t(29) = 11.890, p < 0.05), with 58.5% latency reduction and 99.2% token reduction from raw HTML.

Files

sdf-whitepaper.pdf

Files (185.5 kB)

Name Size Download all
md5:2f6a8bc07c8e229692728a8f289694b0
185.5 kB Preview Download

Additional details

Dates

Submitted
2026-02-09
Submission date