Published March 11, 2021 | Version 1
Dataset Open

NewsEye / READ OCR training dataset from Finnish Newspapers (18th, 19th, early 20th C.)

  • 1. University of Innsbruck
  • 2. READ-COOP

Description

The dataset comprises finnish newspaper pages from late 18th till early 20th century with carefully corrected text. The page images were provided by the National Library Finland (NLF) and comprise 526 pages (training set) and 8 pages (validation set). The data are formed according to the PAGE format (cf. Cf. https://github.com/PRImA-Research-Lab/PAGE-XML/) and were produced with the Transkribus platform with support of the NewsEye and the READ project.

Files

ATR_TrainingSet_NLF_Newseye_GT_FI_M2+.zip

Files (5.5 GB)

Name Size Download all
md5:34094df0d9695e239e7d72b2268fcdb3
5.4 GB Preview Download
md5:ef9d3823c1a175b2cb501e07f2e943c2
77.2 MB Preview Download

Additional details

Funding

European Commission
NewsEye - NewsEye: A Digital Investigator for Historical Newspapers 770299
European Commission
READ - Recognition and Enrichment of Archival Documents 674943