UPDATE: Zenodo migration postponed to Oct 13 from 06:00-08:00 UTC. Read the announcement.

Report Open Access

Improving the recognition of Dutch Gothic machine print, at four levels in the processing pipeline, in four days

Schomaker, Lambert; Ameryan, Mahya; Cuper, Mirjam; Dercksen, Koen; Guo, Jerry; Koert, Rutger van; Mendrik, Adriënne; Todorov, Konstantin; Wang, Xue

Libraries and archives are struggling with optical character recognition (OCR) of old machine-print fonts such as Gothic or 'fraktur'. This font was used in many important historical printed collections such as administrative texts and the then (17th century) newly invented 'newspapers' with interesting and detailed reports on important developments and events. When applying current state of the art OCR tools or sending the scanned images to large well-known companies that provide OCR services, the returned results are still quite disappointing. Problems are observed at all levels in the processing pipeline: binarisation suffering from ink bleed-through, layout analysis suffering from deviating page designs, marginalia and graphics, character recognition suffering from lack of pertinent font examples and font variation (Roman/Gothic) in a document and, finally, linguistic post processing suffering from an utter lack of encoded digital text corpora of suitable size. Actually, the OCR process is often intended to arrive at such corpora in the first place.

A team was formed to approach these problems in four days, with a fifth day for reporting (other teams were working on other industrial problems at the Lorentz Center, this week). The team decided to address problems at all levels in the processing pipeline.

Files (2.0 MB)
Name Size
Schomaker-et-al-Lorentz-2020-ICT-with-Industry-report-OCR-Dutch-Gothic-TechReport.pdf
md5:2b819fc299cd123eb8fa30c0b7eb70f2
2.0 MB Download
95
64
views
downloads
All versions This version
Views 9595
Downloads 6464
Data volume 130.1 MB130.1 MB
Unique views 8686
Unique downloads 6363

Share

Cite as