Published October 11, 2021 | Version 1.0
Dataset Open

Dataset for Logical-layout analysis on French historical newspapers

  • 1. Centre Tesnière - CRIT, Université de Franche-Comté

Contributors

Data manager:

  • 1. CRIT, Université de Franche-Comté

Description

Dataset for Logical-layout analysis on French historical newspapers

This dataset is intended for training and testing Logical Layout Analysis and recognition system on French historical documents published between 1900 and 1950. The original data is part of the "Fond régional: Franche-Comté", which is curated by Gallica, the digital portal of the Bibliothèque Nationale de France (BnF). This dataset has the following structure:

├── train
  ├── 1c
    ├── cb32836282t
      ├── cb32836282t.xml
      ├── bpt6k112325g
        ├── bpt6k112325g.xml
        ├── truelabels_block.csv
        ├── truelabels_line.csv
      ├── …
    ├── …
  ├── 2c
  ├── 3c+
└── test
  ├── 1c
  ├── 2c
  └── 3c+

The dataset is divided into a train and a test set. The train and test datasets have been designed to cover as much as possible the various possible layouts that exist in the "Fond régional: Franche-Comté" dataset. To do so, we have divided them into three layout types:
  • 1c: documents where the text is displayed in one column, as in books;
  • 2c: documents where the text is displayed into two columns;
  • 3c+: documents where there are at least 3 columns of text, as in newspapers.

Each of the 1c, 2c, and 3c+ folder contains subfolders prefixed by ‘cb’, which contain a collection of documents. For instance, « cb32836282t » is the identifier used in Gallica for « Le Petit écho du 21e Régiment d'infanterie », a French military periodical published during WWI. An XML file with the same name, for instance «cb32836282t.xml », contains metadata about the collection, such as its title, publisher, creator, number of issues, etc. This XML file serves only to describe the collection, and is not to be used for Logical-Layout analysis.

The issues in each collection can be found in the subfolders prefixed with « bpt ». For instance, « bpt6k112325g » is the identifier used in Gallica for an issue published in September 1917 of « Le Petit écho du 21e Régiment d'infanterie ». The information about each issue is given in three files, which are described below:

1-bptXXXXXXXXXX.xml
The original data, as collected from Gallica. The most important tags of this document and their values are described below:
  • oai: metadata about the document, such as its author, title, publisher, original publication date, number of issues, …
  • image_url: the url to the document’s scan (in high resolution)
  • pagination: a description of each page in the document (size of the page, if it contains a table of content or not, …)
  • num_pages: the total number of pages in the document
  • ocr: the OCR representation of the document in the XML ALTO format

The XML ALTO format provides the text content and physical layout of documents in the following manner. Lines of text are contained in TextLine tags, which in their turn contain String tags for words and SP tags for spaces. TextLine tags are grouped into blocks in TextBlock tags. Sometimes, TextBlock tags are also grouped into ComposedBlock tags. TextBlock and TextLine tags have the following attributes:
  • Id : the tag’s identifier
  • Height, Width : the text height and width
  • Vpos : the vertical position of the text on the page. The higher the value, the lower the word is on the page
  • Hpos : the horizontal position of the text on the page. The higher the value, the further on the right the text is on the page
  • Language : the language of the text (only for TextBlock tags).

Among the attributes listed above, some TextBlock tags also have a Type attribute. This attribute contains logical labels of the lines in the block. In this dataset it appears most often for tables or advertisements. Overall, TextBlock tags that have a Type attribute are rare in this dataset (about 4 % only).

Note: The original scan of every document is accessible on the Gallica website, using the URL https://gallica.bnf.fr/ark:/12148/<IDENTIFIER>, where <IDENTIFIER> should be replaced by the id of the document (e.g.: bpt6k112325g) or the collection (e.g.: cb32836282t).

2-truelabels_block.csv
A CSV file where each line corresponds to a TextBlock tag from the file bptXXXXXXXXXX.xml. This CSV file contains the following columns:
  • page: the page on which the TextBlock tag is located
  • block_id: the id of the TextBlock tag
  • first_last_line: the text content of the first and last TextLine tags inside this TextBlock tag
  • classes: the logical label(s) associated with this TextBlock tag

The possible values in the column classes are : Text, Title, Header and Other.

3-truelabels_line.csv
A CSV file where each line corresponds to a TextLine tag from the file bptXXXXXXXXXX.xml. This CSV file contains the following columns:
  • page: the page where the TextLine tag is located
  • block_id: the id of the TextBlock tag that contains this TextLine tag
  • line_id: the id of this TextLine tag
  • text_line: the text content of this TextLine tag
  • classes: the logical label(s) associated with this TextLine tag

The possible values in the column classes are : Text, Firstline, Title, Header and Other. Firstline indicates the « first line » of a paragraph.
 

Files

Logical-Layout-Analysis-Dataset.zip

Files (16.5 MB)

Name Size Download all
md5:23dc9ef3833cb9bfc394160d33bd8068
16.5 MB Preview Download