A Benchmark Dataset for Handwritten Document Layout Analysis in the Wild

Gardella, Marina

doi:10.5281/zenodo.15313600

Published May 5, 2025 | Version v1

Dataset Open

A Benchmark Dataset for Handwritten Document Layout Analysis in the Wild

Gardella, Marina

Handwritten document layout analysis remains a challenging task due to the high variability in writing styles, page structures, and document degradations. Existing datasets often lack sufficient layout diversity, as they are typically sourced from homogeneous collections with similar structural patterns. This limitation hinders the development of robust models capable of generalizing to real-world scenarios. To address this issue, we introduce a new dataset of handwritten documents collected from Wikimedia Commons, representing a broad spectrum of historical and modern documents with varying layouts, languages, and writing conditions. Each document is annotated for layout analysis, with identified page segments and corresponding labels. While the dataset is not intended for large-scale model training, it serves as a valuable benchmark for evaluating layout analysis methods and identifying generalization challenges. By prioritizing layout diversity, this dataset provides a realistic testbed for advancing handwritten document segmentation and structural analysis, ultimately contributing to the development of more adaptable and reliable document processing systems.

Files

HDLA-in-the-wild.zip

Files (18.3 MB)

Name	Size	Download all
HDLA-in-the-wild.zip md5:2baa62c43d4a681d377b828e73d50e1e	18.3 MB	Preview Download

Additional details

Accepted: 2025-04

	All versions	This version
Views	167	167
Downloads	25	25
Data volume	475.0 MB	475.0 MB

A Benchmark Dataset for Handwritten Document Layout Analysis in the Wild

Authors/Creators

Description

Files

HDLA-in-the-wild.zip

Files (18.3 MB)

Additional details

Dates