Published October 8, 2018 | Version v1
Poster Open

Automatic Semantic Text Tagging on Historical Lexica by Combining OCR and Typography Classification

  • 1. University of Würzburg
  • 2. Berlin‐Brandenburg Academy of Sciences and Humanities

Description

When gathering the content of historical lexica the goal is not only to obtain a high quality OCR result but also to perform a precise automatic recognition of typographical attributes. Therefore, we present a method that enables fine-grained typography classification by training an open source OCR engine and show how to map the obtained typography information to the OCR output. As a test case, we use the Sanders’ dictionary (1859-1865), which comprises a particularly complex semantic function of typography. Analogously to the OCR, we produce line-based typography ground truth by assigning a distinct label to each of the five typography classes. After training separate models using the respective ground truth, each model recognizes the line images and the outputs are aligned on word level. Despite the very challenging material, we achieved a character error rate of only 0.4%. Additionally, the typography recognition was able to assign the correct label to close to 99% of the words.

Files

Reul et al. - Automatic Semantic Text Tagging.pdf

Files (4.7 MB)