Published March 9, 2020 | Version v1
Other Open

Quality Measurement for Optical Character Recognition without ground truth data

Description

This document notes most of the research I had done for the National Library of the Netherlands (Koninklijke Bibliotheek) on a project for my Master Thesis. Despite terminating the project due to a misalignment with my study program, it is useful to consider the research conducted so far.

The purpose of the project was to measure the quality of documents processed with OCR by ABBYY FineReader independent of ABBYY's own reports and independent of ground truth data, given that for many documents this will not be available in the future.

Files

notes.pdf

Files (221.8 kB)

Name Size Download all
md5:d35dbe7f9aeaa16f87c94a710e691f54
221.8 kB Preview Download

Additional details

References

  • Feng, M.-L., & Tan, Y.-P. (2004). Contrast adaptive binarization of low quality document images. IEICE Electronics Express, 1 (16), 501{506.
  • Kulp, S., & Kontostathis, A. (2007). On retrieving legal les: Shortening documents and weeding out garbage. In Trec.
  • Rennie, J. D., & Srebro, N. (2005). Loss functions for preference levels: Regression with discrete ordered labels. In Proceedings of the ijcai multidisciplinary workshop on advances in preference handling (Vol. 1).
  • Wudtke, R., Ringlstetter, C., & Schulz, K. U. (2011). Recognizing garbage in ocr output on historical documents. In Proceedings of the 2011 joint workshop on multilingual ocr and analytics for noisy unstructured text data (pp. 1-6).