Evaluation Dataset for WCI: The Web Context Interface for Agentic Web Automation

Alasti, Amirreza

doi:10.5281/zenodo.20434088

Published May 28, 2026 | Version v1

Dataset Open

Evaluation Dataset for WCI: The Web Context Interface for Agentic Web Automation

Alasti, Amirreza (Researcher)¹

1. Leibniz University Hannover

Context & Purpose

This dataset is the official evaluation benchmark for the paper “WCI: The Web Context Interface for Agentic Web Automation” (CIKM 2026 Resource Track). It measures how well Large Language Model (LLM) web agents ground natural-language goals to the correct interactive control across diverse synthetic web interfaces, and how much token-efficient structured context can be compared with standard DOM-based agent inputs.

The benchmark contrasts unannotated DOM baselines (raw HTML, shallow outlines, scraped candidate lists) with the Web Context Interface (WCI)—a cooperative layer where site operators expose bounded, semantically typed controls via data-wci-* annotations and optional site policy files (wci.txt, wci.json, wci.md). The published manuscript assesses the benefits of good annotations, not a head-to-head comparison with equal inputs.

Dataset Composition

The benchmark comprises 50 element-grounding scenarios in scenarios.zip, each packaged as:

raw.html — noisy, human-oriented DOM (no WCI attributes)
annotated.html — same DOM tree with data-wci-* overlay
meta.json — goals, ground truth, decoys, and task definitions (single-shot + multi-step catalog)

Scenarios fall into two tiers:

Tier	Count	Description
Handmade layouts	5	Large, hand-authored pages: flight booking, banking, checkout, admin dashboard, and social feed.
Synthetic layouts	45	Domain-specific templates with structural noise, keyword-trap decoys, generic button labels, and constraint-based goals that stress semantic reasoning rather than label matching.

Supporting files include manifest.json, Playwright-verified ground-truth.generated.json, and multi-step.generated.json (extended task variants per scenario).

Evaluation Task (Primary: Multi-Step Grounding)

The published results in the WCI manuscript use multi-step task grounding on each scenario’s primary task (*.multi-step.primary in meta.json).

Given a natural-language goal, the model must return:

A short JSON action plan (observe / reason / act / verify steps), and
A scored final_action — either a WCI node ID (wciNodeId) or a CSS selector/candidate index for non-WCI baselines.

Unified pass rule (all five input formats): a run passes only if (a) final_action resolves to the ground-truth element in headless Chromium, (b) the choice is not a predefined decoy or WCI competitor trap, and (c) semantic flow coverage against the expected task trajectory is ≥ 0.6 (configurable via --min-coverage).

This isolates affordance grounding and plan coherence on static pages. It does not measure full closed-loop browsing (dynamic DOM updates, network latency, or live agent toolchains).

A separate single-shot harness in the repository (eval:benchmark) supports one-control picking on meta.task.goal for ablations; it is not the source of the public leaderboard tables in the paper.

Provided Representations

Each scenario supports five contextual representations for the same underlying page and goal:

Representation	Description
Raw HTML	Complete, unannotated document (truncated in the eval harness for very large pages).
DOM outline	Shallow structural tree with interactive nodes marked.
Interactive candidates	Numbered list of scraped controls (Mind2Web-style).
WCI full	Full semantic graph from `annotated.html` (all annotated nodes).
WCI grounding	Distilled, actionable-only view with compact encoding, priority filtering, competitor-trap markers, and eval state patches on selected handmade flows so the scored control is reachable.

WCI paths assume a completed annotation pass. Baselines simulate agents without that layer; higher WCI accuracy and lower token usage reflect the value of annotations, not parity in the observation space.

Reported Metrics & Models

Researchers can reproduce manuscript tables using the open eval harness (eval:multistep, eval:merge-leaderboard) and archived reports in the companion repository.

Published multi-step runs (50 scenarios × 5 approaches) include:

Frontier: GPT-5, Gemini 3.5 Flash
Mid/efficient: GPT-5 Nano, GPT-OSS 20B
Compact open models: Qwen 2.5 7B, Llama 3.1 8B

Typical findings on this suite (see paper for full tables):

WCI grounding: ~82–96% pass rate depending on model
Raw HTML baseline: ~0–2% on the same multi-step task
Token reduction: ~5–8× fewer tokens per call for WCI grounding vs raw HTML (~540–810 vs ~4,100–5,900 tokens per scenario)

Usage & Reproducibility

Download scenarios.zip and verify selectors: npm run eval:verify
Run multi-step eval (requires OPENROUTER_API_KEY): npm run eval:multistep
Merge leaderboard snapshots: npm run eval:merge-leaderboard

Full methodology, limitations (static fixtures, annotation assumptions, asymmetric inputs), and approach definitions: companion repository evals/README.md and paper §Preliminary Evaluation.

License: CC BY 4.0 (see Zenodo record).
Citation: use the Zenodo DOI 10.5281/zenodo. 20434088 and the WCI software citation from the manuscript.

Scope Boundaries (read before citing)

Synthetic offline fixtures, not production web traffic.
Annotation quality is assumed; mislabeling and drift are not scored.
WCI vs baselines use different context budgets and reference flows by design—the gap measures annotation benefit, not “same bits in, same bits out.”
Results are point-in-time OpenRouter model snapshots; compare models on the same approach, not only blended “standard” vs WCI columns.

Files

scenarios.zip

Files (1.2 MB)

Name	Size	Download all
scenarios.zip md5:b6d2616672537b7597faf7fb53de5a66	1.2 MB	Preview Download

Additional details

Repository URL: https://github.com/amirrezaalasti/webcontextinterface
Programming language: HTML , JSON
Development Status: Active

	All versions	This version
Views	21	21
Downloads	1	1
Data volume	1.2 MB	1.2 MB

Evaluation Dataset for WCI: The Web Context Interface for Agentic Web Automation

Authors/Creators

Description

Context & Purpose

Dataset Composition

Evaluation Task (Primary: Multi-Step Grounding)

Provided Representations

Reported Metrics & Models

Usage & Reproducibility

Scope Boundaries (read before citing)

Files

scenarios.zip

Files (1.2 MB)

Additional details

Software