Convert Once, Consume Many: SDF for Cacheable, Typed Semantic Extraction from Web Pages

Sarkar, Pranab

doi:10.5281/zenodo.18559223

Published February 9, 2026 | Version 1.0.0

Preprint Open

Convert Once, Consume Many: SDF for Cacheable, Typed Semantic Extraction from Web Pages

Sarkar, Pranab (Researcher)

AI agents and large language models (LLMs) typically consume web pages by fetching HTML and repeatedly performing boilerplate removal, chunking, and semantic extraction at query time. We present Structured Data Format (SDF), an open, schema-validated JSON protocol for publishing pre-extracted, agent-oriented semantic representations of web content. A production crawler processed 6,206 URLs and generated 2,335+ schema-valid SDF documents across 10 parent types and 74 observed type combinations. A fine-tuned 1.5B + 3B pipeline achieves 4.1x latency reduction versus a 14B baseline with 90% exact extraction accuracy. A downstream consumption experiment showed the SDF path achieved 0.739 mean accuracy versus 0.352 raw at 7B (t(29) = 11.890, p < 0.05), with 58.5% latency reduction and 99.2% token reduction from raw HTML.

Files

sdf-whitepaper.pdf

Files (185.5 kB)

Name	Size	Download all
sdf-whitepaper.pdf md5:2f6a8bc07c8e229692728a8f289694b0	185.5 kB	Preview Download

Additional details

Submitted: 2026-02-09

Submission date

Repository URL: https://github.com/sdfprotocol/sdfprotocol.github.io

	All versions	This version
Views	475	475
Downloads	265	265
Data volume	58.8 MB	58.8 MB

sdf-whitepaper.pdf

Files (185.5 kB)

Dates

Software

Convert Once, Consume Many: SDF for Cacheable, Typed Semantic Extraction from Web Pages

Authors/Creators

Description

Files

sdf-whitepaper.pdf

Files (185.5 kB)

Additional details

Dates

Software