Published May 5, 2025 | Version v1
Dataset Open

A Benchmark Dataset for Handwritten Document Layout Analysis in the Wild

Authors/Creators

Description

Handwritten document layout analysis remains a challenging task due to the high variability in writing styles, page structures, and document degradations. Existing datasets often lack sufficient layout diversity, as they are typically sourced from homogeneous collections with similar structural patterns. This limitation hinders the development of robust models capable of generalizing to real-world scenarios. To address this issue, we introduce a new dataset of handwritten documents collected from Wikimedia Commons, representing a broad spectrum of historical and modern documents with varying layouts, languages, and writing conditions. Each document is annotated for layout analysis, with identified page segments and corresponding labels. While the dataset is not intended for large-scale model training, it serves as a valuable benchmark for evaluating layout analysis methods and identifying generalization challenges. By prioritizing layout diversity, this dataset provides a realistic testbed for advancing handwritten document segmentation and structural analysis, ultimately contributing to the development of more adaptable and reliable document processing systems. 

 

Files

HDLA-in-the-wild.zip

Files (18.3 MB)

Name Size Download all
md5:2baa62c43d4a681d377b828e73d50e1e
18.3 MB Preview Download

Additional details

Dates

Accepted
2025-04