Evaluation Dataset for WCI: The Web Context Interface for Agentic Web Automation
Description
Context & Purpose
This dataset is the official evaluation benchmark for the paper “WCI: The Web Context Interface for Agentic Web Automation” (CIKM 2026 Resource Track). It measures how well Large Language Model (LLM) web agents ground natural-language goals to the correct interactive control across diverse synthetic web interfaces, and how much token-efficient structured context can be compared with standard DOM-based agent inputs.
The benchmark contrasts unannotated DOM baselines (raw HTML, shallow outlines, scraped candidate lists) with the Web Context Interface (WCI)—a cooperative layer where site operators expose bounded, semantically typed controls via data-wci-* annotations and optional site policy files (wci.txt, wci.json, wci.md). The published manuscript assesses the benefits of good annotations, not a head-to-head comparison with equal inputs.
Dataset Composition
The benchmark comprises 50 element-grounding scenarios in scenarios.zip, each packaged as:
raw.html— noisy, human-oriented DOM (no WCI attributes)annotated.html— same DOM tree withdata-wci-*overlaymeta.json— goals, ground truth, decoys, and task definitions (single-shot + multi-step catalog)
Scenarios fall into two tiers:
Supporting files include manifest.json, Playwright-verified ground-truth.generated.json, and multi-step.generated.json (extended task variants per scenario).
Evaluation Task (Primary: Multi-Step Grounding)
The published results in the WCI manuscript use multi-step task grounding on each scenario’s primary task (*.multi-step.primary in meta.json).
Given a natural-language goal, the model must return:
- A short JSON action plan (
observe/reason/act/verifysteps), and - A scored
final_action— either a WCI node ID (wciNodeId) or a CSS selector/candidate index for non-WCI baselines.
Unified pass rule (all five input formats): a run passes only if (a) final_action resolves to the ground-truth element in headless Chromium, (b) the choice is not a predefined decoy or WCI competitor trap, and (c) semantic flow coverage against the expected task trajectory is ≥ 0.6 (configurable via --min-coverage).
This isolates affordance grounding and plan coherence on static pages. It does not measure full closed-loop browsing (dynamic DOM updates, network latency, or live agent toolchains).
A separate single-shot harness in the repository (eval:benchmark) supports one-control picking on meta.task.goal for ablations; it is not the source of the public leaderboard tables in the paper.
Provided Representations
Each scenario supports five contextual representations for the same underlying page and goal:
WCI paths assume a completed annotation pass. Baselines simulate agents without that layer; higher WCI accuracy and lower token usage reflect the value of annotations, not parity in the observation space.
Reported Metrics & Models
Researchers can reproduce manuscript tables using the open eval harness (eval:multistep, eval:merge-leaderboard) and archived reports in the companion repository.
Published multi-step runs (50 scenarios × 5 approaches) include:
- Frontier: GPT-5, Gemini 3.5 Flash
- Mid/efficient: GPT-5 Nano, GPT-OSS 20B
- Compact open models: Qwen 2.5 7B, Llama 3.1 8B
Typical findings on this suite (see paper for full tables):
- WCI grounding: ~82–96% pass rate depending on model
- Raw HTML baseline: ~0–2% on the same multi-step task
- Token reduction: ~5–8× fewer tokens per call for WCI grounding vs raw HTML (~540–810 vs ~4,100–5,900 tokens per scenario)
Usage & Reproducibility
- Download
scenarios.zipand verify selectors:npm run eval:verify - Run multi-step eval (requires
OPENROUTER_API_KEY):npm run eval:multistep - Merge leaderboard snapshots:
npm run eval:merge-leaderboard
Full methodology, limitations (static fixtures, annotation assumptions, asymmetric inputs), and approach definitions: companion repository evals/README.md and paper §Preliminary Evaluation.
License: CC BY 4.0 (see Zenodo record).
Citation: use the Zenodo DOI 10.5281/zenodo. 20434088 and the WCI software citation from the manuscript.
Scope Boundaries (read before citing)
- Synthetic offline fixtures, not production web traffic.
- Annotation quality is assumed; mislabeling and drift are not scored.
- WCI vs baselines use different context budgets and reference flows by design—the gap measures annotation benefit, not “same bits in, same bits out.”
- Results are point-in-time OpenRouter model snapshots; compare models on the same approach, not only blended “standard” vs WCI columns.
Files
scenarios.zip
Files
(1.2 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:b6d2616672537b7597faf7fb53de5a66
|
1.2 MB | Preview Download |
Additional details
Software
- Repository URL
- https://github.com/amirrezaalasti/webcontextinterface
- Programming language
- HTML , JSON
- Development Status
- Active