Published February 22, 2026 | Version v2
Preprint Open

XPathGenie: LLM-Driven Automated XPath Generation with Multi-Page Validation and Two-Tier Refinement

Authors/Creators

  • 1. Independent Researcher

Description

We present XPathGenie, a system that automates XPath mapping generation from raw URLs

using HTML structural compression (~97% reduction), LLM-based inference, multi-page

validation, and two-tier refinement. Unlike per-page LLM extraction systems, XPathGenie invokes

AI once to generate reusable XPath expressions, ensuring zero marginal AI cost per page.

Evaluation across 23 medical job-listing websites achieved 85.1–87.3% field-level hit rate—

deliberately measuring structural extraction stability rather than semantic accuracy, as the goal is

reusable XPath generation rather than value extraction benchmarking—with 11 sites at 100%;

supplementary cross-domain evaluation on 10 sites across 5 non-medical domains (e-commerce,

real estate, recipe, restaurant reviews, news) achieved a macro-average hit rate of 79.4%,

confirming domain generalizability. An additional English-language evaluation on 10 sites (3

pages each) across 10 domains achieved a macro-average hit rate of 78.7% among successful sites

(7/10), with GitHub and Quotes to Scrape reaching 100%, providing preliminary evidence of

cross-linguistic applicability. Core-field analysis reveals that schema-guided extraction primarily

expands coverage (+13.1pp) over open-ended discovery. A zero-shot evaluation on a subset of the

SWDE benchmark (22 sites, 8 verticals, 220 pages) achieved F1 = 0.689 on fields where XPaths

were successfully generated, with 60% of detected fields at perfect F1 = 1.0; automated semantic

classification on 400 SWDE field-value pairs yielded 78.0% semantic accuracy, complementing

the 95.0% from manual evaluation on production sites. The primary bottleneck is field discovery

coverage (46%) rather than extraction accuracy, contrasting with supervised systems like AXE

(F1 88.1%) that benefit from labeled training data. We identify the compression-generation gap—

a mismatch between compressed and raw HTML whitespace—resolved via normalize-

space() predicates.

Files

xpathgenie_whitepaper.pdf

Files (686.3 kB)

Name Size Download all
md5:38172bf1fb8ba7454a2fd6aaed575cb9
686.3 kB Preview Download

Additional details