A Benchmark Dataset for Handwritten Document Layout Analysis in the Wild
Authors/Creators
Description
Handwritten document layout analysis remains a challenging task due to the high variability in writing styles, page structures, and document degradations. Existing datasets often lack sufficient layout diversity, as they are typically sourced from homogeneous collections with similar structural patterns. This limitation hinders the development of robust models capable of generalizing to real-world scenarios. To address this issue, we introduce a new dataset of handwritten documents collected from Wikimedia Commons, representing a broad spectrum of historical and modern documents with varying layouts, languages, and writing conditions. Each document is annotated for layout analysis, with identified page segments and corresponding labels. While the dataset is not intended for large-scale model training, it serves as a valuable benchmark for evaluating layout analysis methods and identifying generalization challenges. By prioritizing layout diversity, this dataset provides a realistic testbed for advancing handwritten document segmentation and structural analysis, ultimately contributing to the development of more adaptable and reliable document processing systems.
Files
HDLA-in-the-wild.zip
Files
(18.3 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:2baa62c43d4a681d377b828e73d50e1e
|
18.3 MB | Preview Download |
Additional details
Dates
- Accepted
-
2025-04