XPathGenie: LLM-Driven Automated XPath Generation with Multi-Page Validation and Two-Tier Refinement
Description
We present XPathGenie, a system that automates XPath mapping generation from raw URLs
using HTML structural compression (~97% reduction), LLM-based inference, multi-page
validation, and two-tier refinement. Unlike per-page LLM extraction systems, XPathGenie invokes
AI once to generate reusable XPath expressions, ensuring zero marginal AI cost per page.
Evaluation across 23 medical job-listing websites achieved 85.1–87.3% field-level hit rate—
deliberately measuring structural extraction stability rather than semantic accuracy, as the goal is
reusable XPath generation rather than value extraction benchmarking—with 11 sites at 100%;
supplementary cross-domain evaluation on 10 sites across 5 non-medical domains (e-commerce,
real estate, recipe, restaurant reviews, news) achieved a macro-average hit rate of 79.4%,
confirming domain generalizability. An additional English-language evaluation on 10 sites (3
pages each) across 10 domains achieved a macro-average hit rate of 78.7% among successful sites
(7/10), with GitHub and Quotes to Scrape reaching 100%, providing preliminary evidence of
cross-linguistic applicability. Core-field analysis reveals that schema-guided extraction primarily
expands coverage (+13.1pp) over open-ended discovery. A zero-shot evaluation on a subset of the
SWDE benchmark (22 sites, 8 verticals, 220 pages) achieved F1 = 0.689 on fields where XPaths
were successfully generated, with 60% of detected fields at perfect F1 = 1.0; automated semantic
classification on 400 SWDE field-value pairs yielded 78.0% semantic accuracy, complementing
the 95.0% from manual evaluation on production sites. The primary bottleneck is field discovery
coverage (46%) rather than extraction accuracy, contrasting with supervised systems like AXE
(F1 88.1%) that benefit from labeled training data. We identify the compression-generation gap—
a mismatch between compressed and raw HTML whitespace—resolved via normalize-
space() predicates.
Files
xpathgenie_whitepaper.pdf
Files
(686.3 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:38172bf1fb8ba7454a2fd6aaed575cb9
|
686.3 kB | Preview Download |
Additional details
Software
- Repository URL
- https://github.com/goodsun/XPathGenie