Is your OCR good enough? A comprehensive assessment of the impact of OCR quality on downstream tasks

Giovanni Colavizza; Mirjam Cuper

doi:10.5281/zenodo.4498186

Published February 3, 2021 | Version v1

Dataset Open

Is your OCR good enough? A comprehensive assessment of the impact of OCR quality on downstream tasks

1. University of Amsterdam
2. National Library of the Netherlands

Is an average OCR quality of 70% enough for my study? What OCR quality should we ask from external suppliers? Should we re-do the OCR of our collections to bring it from 80% to 85%? Libraries and researchers alike face the same dilemma in our times of textual abundance: when is OCR quality good enough? User access, scientific results and the investment of limited resources increasingly depend on answering this question.

This project focuses on a comprehensive assessment of the impact of OCR quality in Dutch newspaper, journal and book collections, comparing it with published results for English and French. This is be done via extrinsic evaluation: assessing results from a set of representative downstream tasks, such as text classification or clustering. The ultimate goal of the project is to contribute guidelines detailing when OCR quality is to be considered good enough, in order to inform the development and use of textual collections.

The datasets released here are described in this Wiki page. Please refer to the project's repository for more information.

Files

data_frames_evaluation.zip

Files (440.4 MB)

Name	Size	Download all
data_frames_evaluation.zip md5:d1465593da6162cbbc3063e997b63c07	440.4 MB	Preview Download

	All versions	This version
Views	485	485
Downloads	54	54
Data volume	27.3 GB	27.3 GB

Is your OCR good enough? A comprehensive assessment of the impact of OCR quality on downstream tasks

Authors/Creators

Description

Files

data_frames_evaluation.zip

Files (440.4 MB)