Enhancing Hindi OCR Accuracy with Large Language Model-Based Post-Processing

Sreevidya B S; Manjula S; Dr. Shobha T; Gauravesh Sharma

doi:10.5281/zenodo.15585241

Published 2025 | Version v1

Journal article Open

Enhancing Hindi OCR Accuracy with Large Language Model-Based Post-Processing

Optical Character Recognition (OCR) is a technology that fetches text present in images - generally documents. However, the emergence of multiple fonts to make text appealing has made it increasingly difficult for Optical Character Recognition (OCR) to take place with high precision. This is especially true for Hindi where “robust and efficient recognizers are not yet commercially available” according to Venu G. in his book “Guide to OCR for Indic Scripts”. Teja K. and Jyoti P. attempted to solve this problem in their paper titled "Multi-font Devanagari Text Recognition”, however a key issue of word error was not addressed. In a paper by Ray Smith, Tesseract OCR for Hindi was found to have a word error rate of 69.44 percent and character error of 15.41 percent. This paper implements Tesseract OCR for Hindi, trained on multiple Hindi fonts, and incorporates pre-processing techniques such as de-skewing, noise reduction, and binarization. The further investigation has been done for exploring a Large Language Model (LLM)—specifically GPT-4—that can correct recognition errors produced by the standard Hindi Tesseract engine. Using a 125-page corpus comprising Mahabharata commentary, epics, and modern newspapers, applied a lightweight image pre-processing pipeline, performed baseline OCR, and feed the raw output to the LLM for token-level correction. The integrated pipeline reduces the CER to 2.47% and the WER to 5.83% on held-out data.

Files

1-DEJ1635.pdf

Files (1.0 MB)

Name	Size	Download all
1-DEJ1635.pdf md5:6d3822682bc7bbc2188e23342fde1e53	1.0 MB	Preview Download

	All versions	This version
Views	172	172
Downloads	131	131
Data volume	166.3 MB	166.3 MB

Enhancing Hindi OCR Accuracy with Large Language Model-Based Post-Processing

Authors/Creators

Description

Files

1-DEJ1635.pdf

Files (1.0 MB)