Published March 29, 2026 | Version v1.0

Murrough-Foley/web-content-extraction-benchmark: WCXB v1.0: Web Content Extraction Benchmark

Authors/Creators

Description

WCXB v1.0

The first public release of the Web Content Extraction Benchmark.

Contents

  • 2,008 web pages from 1,613 domains across 7 page types
  • 1,497-page development set + 511-page held-out test set
  • Ground truth annotations (title, author, date, main content, with/without snippets)
  • Page type labels: article, forum, product, collection, listing, documentation, service
  • Gzipped HTML source files
  • Standalone evaluation script
  • Metadata with page types, domains, and split assignments

Page Types

| Type | Dev | Test | |------|----:|-----:| | Article | 793 | 257 | | Service | 165 | 59 | | Product | 119 | 28 | | Collection | 117 | 34 | | Forum | 113 | 51 | | Listing | 99 | 40 | | Documentation | 91 | 42 |

License

CC-BY-4.0

Files

Murrough-Foley/web-content-extraction-benchmark-v1.0.zip

Files (84.3 MB)

Additional details