Published June 26, 2025 | Version 1.0.2
Dataset Open

TextBite: A Historical Czech Document Dataset for Logical Page Segmentation

  • 1. ROR icon Brno University of Technology

Description

TextBite is a dataset of historical Czech documents spanning the 18th to 20th centuries, featuring diverse layouts from newspapers, dictionaries, and handwritten records. It is mainly aimed at logical segmentation, but can be used for other tasks as well. Additionally, part of the dataset contains handwritten documents, primarily records from schools and public organizations, introducing extra segmentation challenges due to their more loosely structured layouts.

In total, the dataset contains 8,449 annotated pages, from which 7,346 pages are printed and 1,103 are handwritten. The pages contain a total of 78,863 segments. The test subset contains 964 pages, of which 185 are handwritten. The annotations are provided in an extended COCO format. Each segment is represented by a set of axis aligned bounding boxes, which are connected by directed relationships, representing reading order. To include these relationships in the COCO format, a new top-level key relations is added. Each relation entry specifies a source and a target bounding box.

In addition to the layout annotations, we provide a textual representation of the pages produced by Optical Character Recognition (OCR) tool PERO-OCR. These come in the form of XML files in the PAGE-XML format, which includes an enclosing polygon for each individual textline along with the transcriptions and their confidences. Lastly, we provide the OCR results in the ALTO format, which includes polygons for individual words in the page image.

Files

models.zip

Files (12.0 GB)

Name Size Download all
md5:1db7273cde93395123498d6f58e30a58
77.3 MB Preview Download
md5:883ffea9f869c21644b95db4820c340d
11.7 GB Preview Download
md5:66d1953fa385e2b2d1994c77d5fa151e
218.3 MB Preview Download

Additional details

Related works

Is supplement to
Publication: arXiv:2503.16664 (arXiv)

Funding

Ministry of Culture
NAKI III project semANT - Semantic Document Exploration DH23P03OVV060

Software

Repository URL
https://github.com/DCGM/textbite-dataset
Programming language
Python