Convert Once, Consume Many: SDF for Cacheable, Typed Semantic Extraction from Web Pages
Authors/Creators
Description
AI agents and large language models (LLMs) typically consume web pages by fetching HTML and repeatedly performing boilerplate removal, chunking, and semantic extraction at query time. We present Structured Data Format (SDF), an open, schema-validated JSON protocol for publishing pre-extracted, agent-oriented semantic representations of web content. A production crawler processed 6,206 URLs and generated 2,335+ schema-valid SDF documents across 10 parent types and 74 observed type combinations. A fine-tuned 1.5B + 3B pipeline achieves 4.1x latency reduction versus a 14B baseline with 90% exact extraction accuracy. A downstream consumption experiment showed the SDF path achieved 0.739 mean accuracy versus 0.352 raw at 7B (t(29) = 11.890, p < 0.05), with 58.5% latency reduction and 99.2% token reduction from raw HTML.
Files
sdf-whitepaper.pdf
Files
(185.5 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:2f6a8bc07c8e229692728a8f289694b0
|
185.5 kB | Preview Download |
Additional details
Dates
- Submitted
-
2026-02-09Submission date
Software
- Repository URL
- https://github.com/sdfprotocol/sdfprotocol.github.io