Improving the Recognition Accuracy of Tesseract-OCR Engine on Nepali Text Images via Preprocessing

Umesh Hengaju; Dr Bal Krishna Bal

doi:10.5281/zenodo.4361896

Published December 19, 2020 | Version v1

Journal article Open

Improving the Recognition Accuracy of Tesseract-OCR Engine on Nepali Text Images via Preprocessing

1. Information and Language Processing Research Lab, Department of Computer Science and Engineering, Kathmandu University, Dhulikhel, Kavre, Nepal.

Image Documents scanned or captured by digital cameras on mobile phones suffer from a number of limitations like geometric distortions, focus loss, uneven lightning conditions, low scanning resolution etc. Because of these limitations, the quality of image documents is often degraded and because of this, the recognition accuracy of OCR engines gets affected. This work focuses on improving the recognition of Tesseract-OCR engine for Nepali image documents via preprocessing. For this purpose, we developed an image preprocessing pipeline consisting of 8 steps and tested with several Nepali text images which were collected from different sources like Nepali news corpus, books, printed documents etc. Our test results showed that the recognition accuracy improved from 90.69%, 54.34% and 38.45 to 94.84%, 71.15% and 51.21% respectively for high, medium and low quality images.

Files

Improving the Recognition Accuracy -Formatted Paper 2.pdf

Files (588.3 kB)

Name	Size	Download all
Improving the Recognition Accuracy -Formatted Paper 2.pdf md5:79b5b6dcddc2d9b1886d6b7d6868331f	588.3 kB	Preview Download

Additional details

Khedekar, S., Ramanaprasad, V., Setlur, S., & Govindaraju, V. (2003, August). Text-image separation in devanagari documents. In Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings. (pp. 1265-1269). IEEE.
Kompalli, S., Nayak, S., Setlur, S., & Govindaraju, V. (2005, August). Challenges in OCR of Devanagari documents. In Eighth International Conference on Document Analysis and Recognition (ICDAR'05) (pp. 327-331). IEEE.
Smith, R. (2007). An Overview of the Tesseract OCR Engine. In proceedings of Document analysis and Recognition. ICDAR.
Bieniecki, W., Grabowski, S., & Rozenberg, W. (2007, May). Image preprocessing for improving ocr accuracy. In 2007 International Conference on Perspective Technologies and Methods in MEMS Design (pp. 75-80). IEEE.
Alginahi, Y. (2010). Preprocessing Techniques in Character Recognition, Character Recognition, Minoru Mori (Ed.), ISBN: 978-953-307-105-3, InTech.
Bansal, V., & Sinha, M. K. (2001, September). A complete OCR for printed Hindi text in Devanagari script. In Proceedings of Sixth International Conference on Document Analysis and Recognition (pp. 0800-0800). IEEE Computer Society.
Yadav, D., Sánchez-Cuadrado, S., & Morato, J. (2013). Optical character recognition for Hindi language using a neural-network approach. JIPS, 9(1), 117-140.
Gupta, D., & Nair, L. (2013). Improving OCR By Effective PreProcessing and Segmentation for Devanagiri Script: A Quantified Study. Journal of Theoretical & Applied Information Technology, 52(2).
Badla, S. (2014). Improving the efficiency of Tesseract OCR Engine.
Bawa, R. K., & Sethi, G. K. (2014). A binarization technique for extraction of devanagari text from camera based images. Signal & Image Processing, 5(2), 29.

	All versions	This version
Views	434	434
Downloads	1,765	1,741
Data volume	1.1 GB	1.1 GB

Improving the Recognition Accuracy of Tesseract-OCR Engine on Nepali Text Images via Preprocessing

Creators

Description

Files

Improving the Recognition Accuracy -Formatted Paper 2.pdf

Files (588.3 kB)

Additional details

References