There is a newer version of the record available.

Published March 12, 2021 | Version 1
Dataset Open

NewsEye / READ AS training dataset from French Newspapers (19th, early 20th C.)

  • 1. University of Innsbruck
  • 2. READ-COOP

Description

The dataset comprises French newspaper pages from 19th and early 20th century with annotated text. The page images were provided by the French National Library and comprise 184 pages (training set). The data are formed according to the PAGE format (cf. Cf. https://github.com/PRImA-Research-Lab/PAGE-XML/) and were produced with the Transkribus platform with support of the NewsEye and the READ project. The guidelines for creating AS GT were added to the 'Additional notes'.

Notes

Article GT guidelines for Newseye (as of March 2020) Article resp. 'news item' - An article or news item is defined as a piece of content which can clearly be separated from other similar pieces by its content. It comprises therefore not only "articles" but also advertisements, classified advertisements and other contributions within a newspaper. - The PAGE XML contains for each line the custom tag 'structure' with type 'article' and the id of the individual article. All lines with the same id belong to the same article. There are five different types of regions: TextRegion, Graphic/ImageRegion, TableRegion, AdvertRegion/ClassifiedAdvertRegion, SeparatorRegion. In detail: - The TextRegions are located at the text block level. In addition, the individual TextRegions do not overlap - If the blocks also appear in a Graphic/ImageRegion a TextRegion was created. - For tables two regions were created: TableRegion and TextRegion of the same size (within a table, no text blocks need to be marked) - An AdvertRegion is not only advertising, but also general/classified advertisements (e.g. death ads). - Only the visible (both horizontal & vertical) separators are captured by a separator region. Additionally, structure tags for TextRegions are defined as paragraph, heading, caption, enumeration: - TextRegions that mark ordinary blocks of text are tagged with the 'paragraph' tag. - Definition of 'heading' should be rather clear. Subheadings get also marked as 'heading'. The reason is to not introduce an additional structure tag since it is sufficient to have only one. - 'Captions' can be found obviously beneath images and graphics. - Enumerations: the individual text blocks are marked, but additionally around the entire enumeration a text region was drawn with the structure type 'enumeration'.

Files

AS_TrainingSet_BnF_NewsEye.zip

Files (2.3 GB)

Name Size Download all
md5:531d5a0b8da1b6a050705e20b12c9d90
2.3 GB Preview Download

Additional details

Funding

European Commission
NewsEye - NewsEye: A Digital Investigator for Historical Newspapers 770299
European Commission
READ - Recognition and Enrichment of Archival Documents 674943