TextBite: A Historical Czech Document Dataset for Logical Page Segmentation

Kostelník, Martin; Hradiš, Michal; Beneš, Karel

doi:10.5281/zenodo.15746283

Published June 26, 2025 | Version 1.0.2

Dataset Open

TextBite: A Historical Czech Document Dataset for Logical Page Segmentation

1. Brno University of Technology

TextBite is a dataset of historical Czech documents spanning the 18th to 20th centuries, featuring diverse layouts from newspapers, dictionaries, and handwritten records. It is mainly aimed at logical segmentation, but can be used for other tasks as well. Additionally, part of the dataset contains handwritten documents, primarily records from schools and public organizations, introducing extra segmentation challenges due to their more loosely structured layouts.

In total, the dataset contains 8,449 annotated pages, from which 7,346 pages are printed and 1,103 are handwritten. The pages contain a total of 78,863 segments. The test subset contains 964 pages, of which 185 are handwritten. The annotations are provided in an extended COCO format. Each segment is represented by a set of axis aligned bounding boxes, which are connected by directed relationships, representing reading order. To include these relationships in the COCO format, a new top-level key relations is added. Each relation entry specifies a source and a target bounding box.

In addition to the layout annotations, we provide a textual representation of the pages produced by Optical Character Recognition (OCR) tool PERO-OCR. These come in the form of XML files in the PAGE-XML format, which includes an enclosing polygon for each individual textline along with the transcriptions and their confidences. Lastly, we provide the OCR results in the ALTO format, which includes polygons for individual words in the page image.

Files

models.zip

Files (12.0 GB)

Name	Size	Download all
models.zip md5:1db7273cde93395123498d6f58e30a58	77.3 MB	Preview Download
textbite-dataset.zip md5:883ffea9f869c21644b95db4820c340d	11.7 GB	Preview Download
textbite-test-labels.zip md5:66d1953fa385e2b2d1994c77d5fa151e	218.3 MB	Preview Download

Additional details

Is supplement to: Publication: arXiv:2503.16664 (arXiv)

Ministry of Culture
NAKI III project semANT - Semantic Document Exploration DH23P03OVV060

Repository URL: https://github.com/DCGM/textbite-dataset
Programming language: Python

	All versions	This version
Views	316	141
Downloads	446	194
Data volume	1.9 TB	769.0 GB

models.zip

Files (12.0 GB)

Related works

Funding

Software

TextBite: A Historical Czech Document Dataset for Logical Page Segmentation

Authors/Creators

Description

Files

models.zip

Files (12.0 GB)

Additional details

Related works

Funding

Software