An efficient extraction of information from Indian Government issued documents Aadhar and Pan Card

In today's world, everything is getting digitized, and widespread use of data scanning tools and photography. When we have a lot of image data, it becomes important to accumulate data in a form that is useful for the company/organization. Doing it manually is a tedious task and takes an ample amount of time. Hence to simplify the job, we have developed a FLASK API that takes an image folder as an object and returns an excel sheet of relevant data from the image data. We have used optical character recognition and software like pytesseract to extract data from images. Further in the process, we have used natural language processing, and finally, we have found relevant data using the globe and regex module. This model is helpful in data collection from Registration certificates which helps us store data like chassis number, owner name, car number, etc., easily and can be applied to Aadhaar cards and pan cards.


Introduction
In today's world, processing images such as invoices and handwritten bills has become an essential process in every sector, especially with extensive data scanning tools and photography. In a fast-moving world, it cannot be expected to spend much time typing the data into a particular form, leading to a waste of time. Although extracting text from images with 100% accuracy is quite a demanding task. [13] As deep learning expands and OCR technologies have progressed, semi or fully automated solutions relating to document information extraction are seeing wider adoption. Modern OCR software is quick, precise and can manage common document processing constraints such as poorly or imperfectly formatted scans, handwritten documents, low-quality images/scans, and blemishes that would have traditionally required extended manual interventions. [14] Now organizations are preferring automating document processing methods to become paperless and grasp cloud-based digital solutions. We have made use of optical character recognition and software like pytesseract for the extraction of data from images for the process. Further in the process, we have used natural language processing, and finally, we have found relevant data using the globe and regex module. NLP(Natural Language Processing) is used to clean the data, remove all the irrelevant data from the textual content, and keep useful information. [3] One of the major steps involved in natural language processing is to remove the noise 57 DOI: 10.5281/zenodo.5196052 Received: April 12, 2021 Accepted: August 01, 2021 from the data to make it easier for the machine to detect patterns or, in our case, textual characters. The noise present in the data is in special characters such as hashtags, punctuations, and numbers. All of these are not important for the data and, therefore, should be excluded. Therefore we process the data to remove these elements. Similarly, there are stop words present in the text which introduce unnecessary noise and therefore should be removed. Tokenization and lemmatization are also done on the text to further clean the text to reduce their root words. [4] Part of speech tagging and chunking is also done for cleaning the data and keeping only the relevant information in the text. [5] All this is done using the natural language toolkit python library. We have divided whole paper in six section :introduction,Related work, Dataset,Opencv for OCR,Proposed methodology, Results and Comparison with existing methodology and conclusion.

Related Work
The work of detecting and recognizing texts is previously done in recognizing the data from bills that are either handwritten or printed, and updating them to the database automatically to reduce manual labor. Several deep learning methods are used for such processes as EAST algorithms or SVM. [10] using rnn for recognizing the text from images. [11] There have also been works where Hand character recognition is done using histogram of the oriented gradient for the Recognition of text and then using a support vector machine. Pratik Madhukar Manwatkar and Dr. Kavita R. Singh [15] have reviewed different methods to extract characters from images in their paper. The basic architecture of the process for text recognition from images is reported in their work. They have also mentioned the order of image processing methods to extract textual content from the scanned image. D.Y. Turdakov [16] is based on a text extraction pipeline used to extract textual content from different quality of images acquired from the internet. Their work largely pays attention to dividing the input images into various classes, and then preprocessing is done depending on the classes. This is further followed by text recognition using the OCR engine.

Dataset
Various registration and identification certificates/documents like registration certificates, aadhaar card, or pan card which comprise key information on the person such as name, gender, date of birth, chassis number, car number, etc. [12] Aadhaar is a 12 digit unique identity number that can be procured by one's own accord by residents or passport holders of India, based on their demographic and biometric data. A pan number consists of 10 digit numbers in alphabets and numbers administered by the income tax department to all the taxpayers and unique to each individual. We have used 46 images for our project and then made them run on pytesseract for text recognition, which is further filtered by using globe and regex modules. The collected data is filtered and relevant and is further stored in the form of an excel sheet.

OpenCV for OCR
OpenCV is a computer vision and deep learning software library. It has an ample number of optimized algorithms which provide us with important functions like tracking moving objects, recognizing faces, identifying objects, scanning real images, and processing and analyzing them.
[77] Here we have used it for processing, to detect text from images. [8] Pytesseract requires a clean image to detect text; that is why OpenCV has been used.
OPTICAL CHARACTER RECOGNITION is a process of detection and Recognition of the text from the images. It is scanning, analyzing, and detecting the textual content within the images, determining them, extracting the text from the images, and finally translating the images to electronic or encoded text. [6]The accuracy of the OCR is mainly based on text processing and segmentation algorithms. Occasionally it is strenuous to recover textual data from the image due to various reasons such as differences in size, style, orientation, the compounded background of the images, 59 etc. With the world being more digitized by the day, OCR has been immensely increased in various organizations to cut down traditional workloads. This makes it efficient to extract and store information like chassis number, owner name, car number, etc., easily and can be applied to aadhar cards and pan cards. We have first processed the data of images using OpenCV libraries, [9] then we removed the unwanted data from the images by segmentation, tokenization, and lemmatization of the data. It all helps in recognizing text from images and extracting useful information.

Proposed Methodology
Python -Tesseract is an optical character recognition tool for python, which is used to scan the image and then read the text inserted in images. It works in a particular manner. The first step is Adaptive Thresholding [1], which is a method of converting the image into binary images-following that, the next step is connected component analysis [2], which is a method for extracting character outlines. These outlines then are converted into organized text lines, which are further analyzed for some fixed text size [2]. Text is divided into words using definite spaces and fuzzy spaces. Then the process of recognition starts, which includes recognizing each word from the text. Each word exceeded adequately is passed to an adaptive classifier as training data. The whole processing model of Tesseract follows a conventional step-by-step process, as shown in Figure 3:

Results and Comparison with existing methodology
Optical character recognition and software like pytesseract are used for the detection and extraction of data from images. To extract the printed text and the relevant data from the images and convert them into an excel sheet. The process starts by giving a folder of images as an object to FLASK API and using OCR for text recognition and using globe and regex models to filter out the relevant data. Our project used 46 images as an input consisting of various aadhaar cards and pan cards. The end output is the combined data in the form of an excel sheet consisting of relevant information from those cards.  In Table 1, we have compared the our model with latest existing model. It has been seen that our model is performing better than other model in graphical but text component accuracy is slightly low in compare to Seong Ah Chin et. Al [18].

Conclusion
In conclusion, we have developed a flask API used to extract text from formatted cards using OpenCV through OCR. We have implemented this project using pytesseract. We have presented text recognition and extraction of relevant data from cards using various software. This project aims to inspect the method of classifying relevant text from the data into an excel format. This information is often extracted manually in multiple organizations; thus, in our project, automation is done, which could help reduce manual labor and thus fasten the process. This model accuracy is better than existing model Xiaojing Liu et.al [17] and Seong Ah Chin et. Al [18]. Overall accuracy is 95%.