Automatic Semantic Text Tagging on Historical Lexica by Combining OCR and Typography Classification

Reul, Christian; Göttel, Sebastian; Springmann, Uwe; Wick, Christoph; Würzner, Kay-Michael; Puppe, Frank

doi:10.5281/zenodo.1451482

Published October 8, 2018 | Version v1

Poster Open

Automatic Semantic Text Tagging on Historical Lexica by Combining OCR and Typography Classification

1. University of Würzburg
2. Berlin‐Brandenburg Academy of Sciences and Humanities

When gathering the content of historical lexica the goal is not only to obtain a high quality OCR result but also to perform a precise automatic recognition of typographical attributes. Therefore, we present a method that enables fine-grained typography classification by training an open source OCR engine and show how to map the obtained typography information to the OCR output. As a test case, we use the Sanders’ dictionary (1859-1865), which comprises a particularly complex semantic function of typography. Analogously to the OCR, we produce line-based typography ground truth by assigning a distinct label to each of the five typography classes. After training separate models using the respective ground truth, each model recognizes the line images and the outputs are aligned on word level. Despite the very challenging material, we achieved a character error rate of only 0.4%. Additionally, the typography recognition was able to assign the correct label to close to 99% of the words.

Files

Reul et al. - Automatic Semantic Text Tagging.pdf

Files (4.7 MB)

Name	Size	Download all
Reul et al. - Automatic Semantic Text Tagging.pdf md5:10dfcdaab11d69d2b05ba52cd8289bcd	4.7 MB	Preview Download

	All versions	This version
Views	215	215
Downloads	186	186
Data volume	963.8 MB	963.8 MB

Automatic Semantic Text Tagging on Historical Lexica by Combining OCR and Typography Classification

Creators

Description

Files

Reul et al. - Automatic Semantic Text Tagging.pdf

Files (4.7 MB)