Dataset of ICPR 2020 Competition on Text Block Segmentation on a NewsEye Dataset
- 1. Institute of Mathematics, CITlab, University of Rostock
Description
This is the data for the ICPR 2020 paper ICPR 2020 Competition on Text Block Segmentation on a NewsEye Dataset.
The data is taken from the NewsEye project and consists of historical newspaper pages (partially binarized) ranging from the 19th to 20th century provided by the Austrian National Library, i.e., especially newspapers in German language. The newspapers made available for this competition comprises the titles "Arbeiter Zeitung", "Illustrierte Kronen Zeitung", "Innsbrucker Nachrichten" and "Neue Freie Presse".
The data is split into two tracks. A simple track with newspaper pages only with continuous text (40 pages training data, 10 pages test data) and a complex track with pages including additional tables, images or advertisements (40 pages training data, 10 pages test data).
The training data (simple_pages_train.zip, complex_pages_train.zip) contains a set of scanned pages. Furthermore, for every image we provide the coordinates of the baselines, the corresponding text of the lines and the text regions marking the text blocks in the well-established PAGE XML format. Additionally, baselines lying within the same block have a unique ID in the so-called "custom tag".
Please note that a text block caputers a whole paragraph and the block outlines enclose the text very closely. Headlines are separately marked and blocks are not across columns. Furthermore, images can be ignored since they (usually) do not contain baselines and occurring tables and framed advertisements are handled as single text blocks.
The following represents a snippet of a PAGE XML file where the baseline with ID "tl_223" forms a block together with all other lines with the block ID "a7"
<TextLine id="tl_223" primaryLanguage="German" custom="readingOrder {index:5;} structure {id:a7; type:article;}">.
The type description "article" in the custom tag is a result of the NewsEye project. In connection with this competition an article means simply a text block.
The test data comes in two versions. One with (simple_pages_test_gt.zip, complex_pages_test_gt.zip) and one without (simple_pages_test.zip, complex_pages_test.zip) the corresponding ground truth. Ground Truth means, in our context, the ideal of a system's output generated by humans.
For each sample in the test data there is an image of the scanned newspaper page with its corresponding PAGE XML file.
In the case without ground truth, the PAGE XML files contain the baselines (without any block ID's), the text and only a single text region surrounding the whole page. The single region should be ignored but is necessary because the PAGE XML format requires that every line is assigned to a region.
In the case with ground truth, the PAGE XML files again contain the text regions marking the text blocks, and the corresponding baseline have again the same block ID's. The passwords for extracting the ground truth test data is "icpr2020!tb_simple" for the simple track and "icpr2020!tb_complex" for the complex track.
Files
complex_pages_test.zip
Files
(211.0 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:358b73bb8de5ddb40a6bd38e256761b3
|
17.7 MB | Preview Download |
|
md5:86303488837c1267c9439f6827689c4c
|
18.1 MB | Preview Download |
|
md5:27aeccb6d25d20548dbe4ae172e13dad
|
80.4 MB | Preview Download |
|
md5:abd1d9c93a7293371362047fa9548e3e
|
17.4 MB | Preview Download |
|
md5:b9e9f53cb93bc955d4d6204db2a73a28
|
18.0 MB | Preview Download |
|
md5:bda093ea671bd628478d6ead16ebeffb
|
59.4 MB | Preview Download |