Published February 1, 2018 | Version 1.2.0
Dataset Open

HTR Dataset ICFHR 2016

  • 1. Pattern Recognition and Human Language Technologies Research Center

Contributors

Researcher:

  • 1. PRHLT

Description

This dataset arises from the READ project (Horizon 2020).

The dataset consists of a subset of documents from the Ratsprotokolle collection composed of minutes of the council meetings held from 1470 to 1805 (about 30.000 pages), which will be used in the READ project. This dataset is written in Early Modern German. The number of writers is unknown. Handwriting in this collection is complex enough to challenge the HTR software.

The training dataset is composed of 400 pages; most of the pages consist of a single block with many difficulties for line detection and extraction. The ground-truth in this set is in PAGE format and it is provided annotated at line level in the PAGE files.

The previous dataset is the same that is located at https://zenodo.org/record/218236#.WnLhaCHhBGF

The new file includes the test set corresponding to the HTR competition held at ICFHR 2016

Notes

Main updates in Version 1.2.0 (Author: Lorenzo Quirós) 1) TextRegions have been labeled into four different structural types (page-number, marginalia, paragraph and heading). 2) The surrounding polygon some TextRegion have been modified to avoid overlaps between regions, and oversized and undersized regions. 3) Spurious regions have been deleted.

Files

Files (551.9 MB)

Name Size Download all
md5:894063b8abbe2f671dbb07a144738300
58.6 MB Download
md5:654f2d2c62055f1847f65ba83bd5d744
493.2 MB Download

Additional details

Funding

READ – Recognition and Enrichment of Archival Documents 674943
European Commission