Published March 9, 2020
| Version v1
Other
Open
Quality Measurement for Optical Character Recognition without ground truth data
Creators
Description
This document notes most of the research I had done for the National Library of the Netherlands (Koninklijke Bibliotheek) on a project for my Master Thesis. Despite terminating the project due to a misalignment with my study program, it is useful to consider the research conducted so far.
The purpose of the project was to measure the quality of documents processed with OCR by ABBYY FineReader independent of ABBYY's own reports and independent of ground truth data, given that for many documents this will not be available in the future.
Files
notes.pdf
Files
(221.8 kB)
Name | Size | Download all |
---|---|---|
md5:d35dbe7f9aeaa16f87c94a710e691f54
|
221.8 kB | Preview Download |
Additional details
References
- Feng, M.-L., & Tan, Y.-P. (2004). Contrast adaptive binarization of low quality document images. IEICE Electronics Express, 1 (16), 501{506.
- Kulp, S., & Kontostathis, A. (2007). On retrieving legal les: Shortening documents and weeding out garbage. In Trec.
- Rennie, J. D., & Srebro, N. (2005). Loss functions for preference levels: Regression with discrete ordered labels. In Proceedings of the ijcai multidisciplinary workshop on advances in preference handling (Vol. 1).
- Wudtke, R., Ringlstetter, C., & Schulz, K. U. (2011). Recognizing garbage in ocr output on historical documents. In Proceedings of the 2011 joint workshop on multilingual ocr and analytics for noisy unstructured text data (pp. 1-6).