Tesseract OCR models for the Alsatian dialects

Bernhard, Delphine

doi:10.5281/zenodo.1404914

Published August 28, 2018 | Version 1

Dataset Open

Tesseract OCR models for the Alsatian dialects

Bernhard, Delphine¹

1. LiLPa, Université de Strasbourg

This dataset provides trained Tesseract (https://github.com/tesseract-ocr/tesseract) OCR models for the Alsatian dialects. These models were developed in the context of the RESTAURE project, funded by the French ANR.

Two models are provided :

The first model, ISKO_2015, has been presented in the following article: https://hal.archives-ouvertes.fr/hal-01252241. The Tesseract model has been trained using the jTessBoxEditor tool (http://vietocr.sourceforge.net/training.html), Version 1.4 (2 May 2015), based on images automatically generated from the training texts (excerpts from 7 different printed works, totalling about 9,000 words). The generation of the images used a 36pt font size, and two fonts were used (Arial and Times New Roman), with their normal and italic variants.
The Tesseract model (gsw.traineddata) can be used with Tesseract 3.0x.

The second model, 2018, has been trained for Tesseract 4.0x, using jTessBoxEditor version 2.0.1 (28 July 2018). Again, images were automatically generated from the training text. The training text is different from the one used for the ISKO_2015 model and is "artificial", in the sense that it has been built by appending word n-grams extracted from a large variety of published texts in Alsatian, for a time period spanning 2 centuries and for different text genres. The images corresponding to this training text have been automatically generated with the Tesseract text2image tool, using the following parameters: --ptsize=36 --leading=20. The fonts used are listed in the gsw.font_properties file.

Dictionary data has also been used for training. We conflated Alsatian words found in several lexicons and corpora:

Lexicons produced by the OLCA (Office pour la Langue et les Cultures d'Alsace et de Moselle): http://www.olcalsace.org/fr/lexiques
Lexicon from a Wiktionary user page: https://fr.wiktionary.org/wiki/Utilisateur:Laurent_Bouvier/alsacien-fran%C3%A7ais
Lexicon from the ACPA association: http://web.archive.org/web/20160302234127/http:/culture.alsace.pagesperso-orange.fr/dictionnaire_alsacien.htm
Chronicles published by Raymond Matzen in the local newspaper "Les Dernières Nouvelles d'Alsace"
Transcriptions of television shows found in Erhart, P. (2012). Les dialectes dans les médias: quelle image de l’Alsace véhiculent-ils dans les émissions de la télévision régionale?, Université de Strasbourg, http://www.theses.fr/167563386
French-Alsatian parallel corpus provided by the OLCA
Excerpts from Adolf, P. (2006). Dictionnaire comparatif multilingue: français-allemand-alsacien-anglais., Strasbourg, France, Midgard, 2006, 373 p.

The Tesseract models can be used for instance using the gImageReader tool (https://github.com/manisandro/gImageReader), which provides a graphical user interface for the Tesseract tool.

When evaluated against the same test corpus (prose by Marie Hart, theater and poetry by Gustave Stokopf and prose by Charles Zumstein, totalling about 4,900 words), both models achieve roughly the same performance levels. Usually, even better performance levels can be achieved by combining the Alsatian-specific model with the French and German models available for Tesseract (available from https://github.com/tesseract-ocr/tessdata)

Files

2018.zip

Files (2.0 MB)

Name	Size	Download all
2018.zip md5:bc1b8259e3ad9d8f4c5203268b321a8d	1.1 MB	Preview Download
ISKO_2015.zip md5:4d7a702dae4f586268e8aeeff58f0b29	849.9 kB	Preview Download

Additional details

Is referenced by: https://hal.archives-ouvertes.fr/hal-01252241 (URL)

	All versions	This version
Views	947	947
Downloads	68	68
Data volume	83.2 MB	83.2 MB

Tesseract OCR models for the Alsatian dialects

Creators

Description

Files

2018.zip

Files (2.0 MB)

Additional details

Related works