Published May 28, 2026 | Version v1
Dataset Open

Evaluation Dataset for WCI: The Web Context Interface for Agentic Web Automation

  • 1. ROR icon Leibniz University Hannover

Description


Context & Purpose

This dataset is the official evaluation benchmark for the paper “WCI: The Web Context Interface for Agentic Web Automation” (CIKM 2026 Resource Track). It measures how well Large Language Model (LLM) web agents ground natural-language goals to the correct interactive control across diverse synthetic web interfaces, and how much token-efficient structured context can be compared with standard DOM-based agent inputs.

The benchmark contrasts unannotated DOM baselines (raw HTML, shallow outlines, scraped candidate lists) with the Web Context Interface (WCI)—a cooperative layer where site operators expose bounded, semantically typed controls via data-wci-* annotations and optional site policy files (wci.txtwci.jsonwci.md). The published manuscript assesses the benefits of good annotations, not a head-to-head comparison with equal inputs.

Dataset Composition

The benchmark comprises 50 element-grounding scenarios in scenarios.zip, each packaged as:

  • raw.html — noisy, human-oriented DOM (no WCI attributes)
  • annotated.html — same DOM tree with data-wci-* overlay
  • meta.json — goals, ground truth, decoys, and task definitions (single-shot + multi-step catalog)

Scenarios fall into two tiers:

Tier Count Description
Handmade layouts
5
Large, hand-authored pages: flight booking, banking, checkout, admin dashboard, and social feed.
Synthetic layouts
45
Domain-specific templates with structural noise, keyword-trap decoys, generic button labels, and constraint-based goals that stress semantic reasoning rather than label matching.

Supporting files include manifest.json, Playwright-verified ground-truth.generated.json, and multi-step.generated.json (extended task variants per scenario).

Evaluation Task (Primary: Multi-Step Grounding)

The published results in the WCI manuscript use multi-step task grounding on each scenario’s primary task (*.multi-step.primary in meta.json).

Given a natural-language goal, the model must return:

  1. A short JSON action plan (observe / reason / act / verify steps), and
  2. A scored final_action — either a WCI node ID (wciNodeId) or a CSS selector/candidate index for non-WCI baselines.

Unified pass rule (all five input formats): a run passes only if (a) final_action resolves to the ground-truth element in headless Chromium, (b) the choice is not a predefined decoy or WCI competitor trap, and (c) semantic flow coverage against the expected task trajectory is ≥ 0.6 (configurable via --min-coverage).

This isolates affordance grounding and plan coherence on static pages. It does not measure full closed-loop browsing (dynamic DOM updates, network latency, or live agent toolchains).

A separate single-shot harness in the repository (eval:benchmark) supports one-control picking on meta.task.goal for ablations; it is not the source of the public leaderboard tables in the paper.

Provided Representations

Each scenario supports five contextual representations for the same underlying page and goal:

Representation Description
Raw HTML
Complete, unannotated document (truncated in the eval harness for very large pages).
DOM outline
Shallow structural tree with interactive nodes marked.
Interactive candidates
Numbered list of scraped controls (Mind2Web-style).
WCI full
Full semantic graph from annotated.html (all annotated nodes).
WCI grounding
Distilled, actionable-only view with compact encoding, priority filtering, competitor-trap markers, and eval state patches on selected handmade flows so the scored control is reachable.

WCI paths assume a completed annotation pass. Baselines simulate agents without that layer; higher WCI accuracy and lower token usage reflect the value of annotations, not parity in the observation space.

Reported Metrics & Models

Researchers can reproduce manuscript tables using the open eval harness (eval:multistepeval:merge-leaderboard) and archived reports in the companion repository.

Published multi-step runs (50 scenarios × 5 approaches) include:

  • Frontier: GPT-5, Gemini 3.5 Flash
  • Mid/efficient: GPT-5 Nano, GPT-OSS 20B
  • Compact open models: Qwen 2.5 7B, Llama 3.1 8B

Typical findings on this suite (see paper for full tables):

  • WCI grounding: ~82–96% pass rate depending on model
  • Raw HTML baseline: ~0–2% on the same multi-step task
  • Token reduction: ~5–8× fewer tokens per call for WCI grounding vs raw HTML (~540–810 vs ~4,100–5,900 tokens per scenario)

Usage & Reproducibility

  1. Download scenarios.zip and verify selectors: npm run eval:verify
  2. Run multi-step eval (requires OPENROUTER_API_KEY): npm run eval:multistep
  3. Merge leaderboard snapshots: npm run eval:merge-leaderboard

Full methodology, limitations (static fixtures, annotation assumptions, asymmetric inputs), and approach definitions: companion repository evals/README.md and paper §Preliminary Evaluation.

License: CC BY 4.0 (see Zenodo record).
Citation: use the Zenodo DOI 10.5281/zenodo. 20434088 and the WCI software citation from the manuscript.

Scope Boundaries (read before citing)

  • Synthetic offline fixtures, not production web traffic.
  • Annotation quality is assumed; mislabeling and drift are not scored.
  • WCI vs baselines use different context budgets and reference flows by design—the gap measures annotation benefit, not “same bits in, same bits out.”
  • Results are point-in-time OpenRouter model snapshots; compare models on the same approach, not only blended “standard” vs WCI columns.

Files

scenarios.zip

Files (1.2 MB)

Name Size Download all
md5:b6d2616672537b7597faf7fb53de5a66
1.2 MB Preview Download

Additional details

Software

Repository URL
https://github.com/amirrezaalasti/webcontextinterface
Programming language
HTML , JSON
Development Status
Active