Published March 29, 2026
| Version v1.0
Software
Open
Murrough-Foley/web-content-extraction-benchmark: WCXB v1.0: Web Content Extraction Benchmark
Authors/Creators
Description
WCXB v1.0
The first public release of the Web Content Extraction Benchmark.
Contents
- 2,008 web pages from 1,613 domains across 7 page types
- 1,497-page development set + 511-page held-out test set
- Ground truth annotations (title, author, date, main content, with/without snippets)
- Page type labels: article, forum, product, collection, listing, documentation, service
- Gzipped HTML source files
- Standalone evaluation script
- Metadata with page types, domains, and split assignments
Page Types
| Type | Dev | Test | |------|----:|-----:| | Article | 793 | 257 | | Service | 165 | 59 | | Product | 119 | 28 | | Collection | 117 | 34 | | Forum | 113 | 51 | | Listing | 99 | 40 | | Documentation | 91 | 42 |
License
CC-BY-4.0
Files
Murrough-Foley/web-content-extraction-benchmark-v1.0.zip
Files
(84.3 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:806decb4547860a8dc93653ecf2e794b
|
84.3 MB | Preview Download |
Additional details
Related works
- Is supplement to
- Software: https://github.com/Murrough-Foley/web-content-extraction-benchmark/tree/v1.0 (URL)