Published August 28, 2018 | Version 1
Dataset Open

Tesseract OCR models for the Alsatian dialects

  • 1. LiLPa, Université de Strasbourg

Description

This dataset provides trained Tesseract (https://github.com/tesseract-ocr/tesseract) OCR models for the Alsatian dialects. These models were developed in the context of the RESTAURE project, funded by the French ANR. 

Two models are provided :

The first model, ISKO_2015, has been presented in the following article: https://hal.archives-ouvertes.fr/hal-01252241. The Tesseract model has been trained using the jTessBoxEditor tool (http://vietocr.sourceforge.net/training.html), Version 1.4 (2 May 2015), based on images automatically generated from the training texts (excerpts from 7 different printed works, totalling about 9,000 words). The generation of the images used a 36pt font size, and two fonts were used (Arial and Times New Roman), with their normal and italic variants.
The Tesseract model (gsw.traineddata) can be used with Tesseract 3.0x.

The second model, 2018, has been trained for Tesseract 4.0x, using jTessBoxEditor version 2.0.1 (28 July 2018). Again, images were automatically generated from the training text. The training text is different from the one used for the ISKO_2015 model and is "artificial", in the sense that it has been built by appending word n-grams extracted from a large variety of published texts in Alsatian, for a time period spanning 2 centuries and for different text genres. The images corresponding to this training text have been automatically generated with the Tesseract text2image tool, using the following parameters: --ptsize=36 --leading=20. The fonts used are listed in the gsw.font_properties file.

Dictionary data has also been used for training. We conflated Alsatian words found in several lexicons and corpora:

The Tesseract models can be used  for instance using the gImageReader tool (https://github.com/manisandro/gImageReader), which provides a graphical user interface for the Tesseract tool. 

When evaluated against the same test corpus (prose by Marie Hart, theater and poetry by Gustave Stokopf and prose by Charles Zumstein, totalling about 4,900 words), both models achieve roughly the same performance levels. Usually, even better performance levels can be achieved by combining the Alsatian-specific model with the French and German models available for Tesseract (available from https://github.com/tesseract-ocr/tessdata)

Files

2018.zip

Files (2.0 MB)

Name Size Download all
md5:bc1b8259e3ad9d8f4c5203268b321a8d
1.1 MB Preview Download
md5:4d7a702dae4f586268e8aeeff58f0b29
849.9 kB Preview Download

Additional details

Related works