Enhancing Hindi OCR Accuracy with Large Language Model-Based Post-Processing
Authors/Creators
Description
Optical Character Recognition (OCR) is a technology that fetches text present in images - generally documents. However, the emergence of multiple fonts to make text appealing has made it increasingly difficult for Optical Character Recognition (OCR) to take place with high precision. This is especially true for Hindi where “robust and efficient recognizers are not yet commercially available” according to Venu G. in his book “Guide to OCR for Indic Scripts”. Teja K. and Jyoti P. attempted to solve this problem in their paper titled "Multi-font Devanagari Text Recognition”, however a key issue of word error was not addressed. In a paper by Ray Smith, Tesseract OCR for Hindi was found to have a word error rate of 69.44 percent and character error of 15.41 percent. This paper implements Tesseract OCR for Hindi, trained on multiple Hindi fonts, and incorporates pre-processing techniques such as de-skewing, noise reduction, and binarization. The further investigation has been done for exploring a Large Language Model (LLM)—specifically GPT-4—that can correct recognition errors produced by the standard Hindi Tesseract engine. Using a 125-page corpus comprising Mahabharata commentary, epics, and modern newspapers, applied a lightweight image pre-processing pipeline, performed baseline OCR, and feed the raw output to the LLM for token-level correction. The integrated pipeline reduces the CER to 2.47% and the WER to 5.83% on held-out data.
Files
1-DEJ1635.pdf
Files
(1.0 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:6d3822682bc7bbc2188e23342fde1e53
|
1.0 MB | Preview Download |