Published September 14, 2020 | Version 1.0
Dataset Open

Dataset of ICPR 2020 Competition on Text Block Segmentation on a NewsEye Dataset

  • 1. Institute of Mathematics, CITlab, University of Rostock

Description

This is the data for the ICPR 2020 paper ICPR 2020 Competition on Text Block Segmentation on a NewsEye Dataset.

The data is taken from the NewsEye project and consists of historical newspaper pages (partially binarized) ranging from the 19th to 20th century provided by the Austrian National Library, i.e., especially newspapers in German language. The newspapers made available for this competition comprises the titles "Arbeiter Zeitung", "Illustrierte Kronen Zeitung", "Innsbrucker Nachrichten" and "Neue Freie Presse".

The data is split into two tracks. A simple track with newspaper pages only with continuous text (40 pages training data, 10 pages test data) and a complex track with pages including additional tables, images or advertisements (40 pages training data, 10 pages test data).

The training data (simple_pages_train.zip, complex_pages_train.zip) contains a set of scanned pages. Furthermore, for every image we provide the coordinates of the baselines, the corresponding text of the lines and the text regions marking the text blocks in the well-established PAGE XML format. Additionally, baselines lying within the same block have a unique ID in the so-called "custom tag".

Please note that a text block caputers a whole paragraph and the block outlines enclose the text very closely. Headlines are separately marked and blocks are not across columns. Furthermore, images can be ignored since they (usually) do not contain baselines and occurring tables and framed advertisements are handled as single text blocks.

The following represents a snippet of a PAGE XML file where the baseline with ID "tl_223" forms a block together with all other lines with the block ID "a7"

    <TextLine id="tl_223" primaryLanguage="German" custom="readingOrder {index:5;} structure {id:a7; type:article;}">.

The type description "article" in the custom tag is a result of the NewsEye project. In connection with this competition an article means simply a text block.

The test data comes in two versions. One with (simple_pages_test_gt.zip, complex_pages_test_gt.zip) and one without (simple_pages_test.zip, complex_pages_test.zip) the corresponding ground truth. Ground Truth means, in our context, the ideal of a system's output generated by humans.

For each sample in the test data there is an image of the scanned newspaper page with its corresponding PAGE XML file.
In the case without ground truth, the PAGE XML files contain the baselines (without any block ID's), the text and only a single text region surrounding the whole page. The single region should be ignored but is necessary because the PAGE XML format requires that every line is assigned to a region.
In the case with ground truth, the PAGE XML files again contain the text regions marking the text blocks, and the corresponding baseline have again the same block ID's. The passwords for extracting the ground truth test data is "icpr2020!tb_simple" for the simple track and "icpr2020!tb_complex" for the complex track.

Files

complex_pages_test.zip

Files (211.0 MB)

Name Size Download all
md5:358b73bb8de5ddb40a6bd38e256761b3
17.7 MB Preview Download
md5:86303488837c1267c9439f6827689c4c
18.1 MB Preview Download
md5:27aeccb6d25d20548dbe4ae172e13dad
80.4 MB Preview Download
md5:abd1d9c93a7293371362047fa9548e3e
17.4 MB Preview Download
md5:b9e9f53cb93bc955d4d6204db2a73a28
18.0 MB Preview Download
md5:bda093ea671bd628478d6ead16ebeffb
59.4 MB Preview Download

Additional details

Funding

NewsEye – NewsEye: A Digital Investigator for Historical Newspapers 770299
European Commission