Published April 27, 2023 | Version 1.0
Dataset Open

ESTER-Pt: An Evaluation Suite for TExt Recognition in Portuguese

Description

disclaimer:  Version accepted as full paper in ICDAR 2023.

Optical Character Recognition (OCR) is a technology that enables machines to read and interpret printed or handwritten texts from scanned images or photographs. However, the accuracy of OCR systems can vary depending on several factors, such as the quality of the input image, the font used, and the language of the document. As a general tendency, OCR algorithms perform better in resource-rich languages as they have more annotated data to train the recognition process. We propose ESTER-Pt, an Evaluation Suite for TExt Recognition in Portuguese in this work. Despite being one of the largest languages in terms of speakers, OCR in Portuguese remains largely unexplored. Our evaluation suite comprises four types of resources: synthetic text-based documents, synthetic image-based documents, real scanned documents, and a hybrid set with real image-based documents that were synthetically degraded.

Files

ESTER-Pt.zip

Files (19.6 GB)

Name Size Download all
md5:745b6e5b74357b116a24c92b6c606d02
19.6 GB Preview Download