There is a newer version of the record available.

Published October 11, 2021 | Version 1.0
Dataset Open

Dataset for Logical-layout analysis on French historical newspapers

  • 1. Centre Tesnière - CRIT, Université de Franche-Comté

Contributors

Data manager:

  • 1. CRIT, Université de Franche-Comté

Description

Dataset for Logical-layout analysis on French Historical Newspapers


This is a dataset for training and testing logical-layout analysis and recognition system on French historical documents published between 1900 and 1950. The original data is part of the "Fond régional: Franche Comté", which is curated by Gallica, the digital portal of the Bibliothèque nationale de France (BnF).

This dataset is divided into a train and a test set.  The train and test datasets have been designed to cover as much as possible the various possible layouts that exist in the "Fond régional: Franche Comté" dataset. To do so, we have divided them into three layout-types:

* 1c: documents where the text is displayed in one column, as in books;
* 2c: documents where the text is displayed into two columns;
* 3c+: documents where there are at least 3 columns of text, as in newspapers.

Each of these folders contain subfolders prefixed by ‘cb’. These are the identifier of a newspaper collection such as « Le Semeur ». An XML describing the collection is contained in each folder, which is not related to the logical-layout analysis purpose. The folders also contain subfolders prefixed by ‘bpt’, with the following files:

* XXX.xml : the original XML file as gathered from Gallica.
* truelabels_block: A CSV file where the True labels for each TextBlock tag are given. Each line contains the page, the block_id, the first and the last line of text of the block and its label;
* truelabels_line: A CSV file where the True labels for each TextLine tag are given. Each line contains the page, the line_id, the text of the line and its label;
* XXX_docbook.xml: the document that has been processed by a Logical Layout recognition system.

The XXX.xml file, which is the original file as stored on Gallica, provides multiple information on the document, such as:

* Metadata, which follows the DublinCore format
* Pagination
* OCR, which follows the XML ALTO format

The OCR output for the whole document is available in a PrintSpace tag. Lines of text are contained in TextLine tags, which in their turn contain String tags for words and SP tags for spaces. TextLine tags are grouped into blocks in TextBlock tags.

The truelabel_block.csv file indicates the True logical label for each TextBlock tag in the document. The possible labels are Text, Title, Header and Other. Similarly, the truelabel_lines.csv file indicates the True logical label for each TextLine tagin the document. The possible labels are Text, Firstline (to indicate the first line of a paragraph), Title, Header or Other. Each line in these documents contain an id, respectively for a block or a line of text, which is found in the OCR section of XXX.xml file.

The XXX_docbook.xml file has been obtained by a rule-based Logical Layout recognition system. It contains the text from the OCR section of the XXX.xml file, surrounded by tags that correspond to logical labels. The possible tags are Header, Title, Para (for paragraph), Sent (for sentences) and Other. Because these files were automatically generated, the labels may differ from the one given in the CSV files.

You can access the original scan of every document on the Gallica website. To do so, use the following URL by replacing  <IDENTIFIER>  with the id of the document (eg: bpt6k76208717) : https://gallica.bnf.fr/ark:/12148/<IDENTIFIER>

Files

Logical-Layout-Analysis-Dataset.zip

Files (22.0 MB)

Name Size Download all
md5:ad381308b35ef0461ee6ae28cc2c57d5
22.0 MB Preview Download