ESTER-Pt: An Evaluation Suite for TExt Recognition in Portuguese
Description
disclaimer: Version accepted as full paper in ICDAR 2023.
Optical Character Recognition (OCR) is a technology that enables machines to read and interpret printed or handwritten texts from scanned images or photographs. However, the accuracy of OCR systems can vary depending on several factors, such as the quality of the input image, the font used, and the language of the document. As a general tendency, OCR algorithms perform better in resource-rich languages as they have more annotated data to train the recognition process. We propose ESTER-Pt, an Evaluation Suite for TExt Recognition in Portuguese in this work. Despite being one of the largest languages in terms of speakers, OCR in Portuguese remains largely unexplored. Our evaluation suite comprises four types of resources: synthetic text-based documents, synthetic image-based documents, real scanned documents, and a hybrid set with real image-based documents that were synthetically degraded.
Files
ESTER-Pt.zip
Files
(19.6 GB)
Name | Size | Download all |
---|---|---|
md5:745b6e5b74357b116a24c92b6c606d02
|
19.6 GB | Preview Download |