Dataset for ICDAR2017 Competition on Handwritten Text Recognition on the READ Dataset (ICDAR2017 HTR)

Sánchez, Joan andreu; Romero, Verónica; Toselli, Alejandor H.; Villegas, Mauricio; Vidal, Enrique

doi:10.5281/zenodo.835489

Published July 27, 2017 | Version v1

Dataset Open

Dataset for ICDAR2017 Competition on Handwritten Text Recognition on the READ Dataset (ICDAR2017 HTR)

1. PRHLT, Universitat Politècnica de València, Spain

Train-A: Dataset of pages with manually revised baselines and the corresponding transcripts associated to them. This batch is small, 50 pages. Please, keep in mind that only the baselines have been manually corrected, The polygons associated to each line have not been manually reviewed.

Train-B: Dataset of pages without any layout or text line information. The corresponding transcripts are provided at page level with line breaks. It has 10k pages, though for convenience it is divided into two 5k page batches. This information is provided in PAGE format.

Test A: Dataset of pages with manually revised baselines. This batch has 65 pages. The polygons associated to each line have not been manually reviewed.

Test-B1: The same dataset of pages of the Test A, but annotated only with the geometry of regions. Text line information is not provided.

Test-B2: Dataset of page images annotated with the geometry of regions where to detect text line and recognize. It has 57 pages.

Baseline.tgz: Baseline system trained using the first 40 pages of Train-A. The system is based on the deep learning toolkit to transcribe handwritten text images called Laia.

More information at:

https://scriptnet.iit.demokritos.gr/competitions/~icdar2017htr/

Files

Files (4.0 GB)

Name	Size
Baseline.tgz md5:5ef6d6d9a1be6785686559d6f8c9b67a	22.1 MB	Download
Test-A.tgz md5:f989a3f056d1b830564594a576b4dc75	70.9 MB	Download
Test-B1.tgz md5:6bea580c2fdcae850041738bc03d8c1c	70.8 MB	Download
Test-B2.tgz md5:0bea41d3beab30431fdb3ad01f5929ab	48.0 MB	Download
Train-A.tbz2 md5:e46c7019f8ac639b796ecb8d872fd481	21.4 MB	Download
Train-B_batch1.tbz2 md5:e11b9d0cb97169d64069268a23e90ef2	1.9 GB	Download
Train-B_batch2.tbz2 md5:93ea0b7285f65c8438155e9490c691ed	1.9 GB	Download

Additional details

European Commission
READ - Recognition and Enrichment of Archival Documents 674943

	All versions	This version
Views	7,387	7,361
Downloads	4,695	4,689
Data volume	7.0 TB	7.0 TB

Dataset for ICDAR2017 Competition on Handwritten Text Recognition on the READ Dataset (ICDAR2017 HTR)

Authors/Creators

Description

Files

Files (4.0 GB)

Additional details

Funding