Is your OCR good enough? Probably so. An assessment of the impact of OCR quality on downstream tasks for Dutch texts

Todorov, Konstantin; Cuper, Mirjam; Colavizza, Giovanni

doi:10.5281/zenodo.4843629

Published May 29, 2021 | Version v1

Conference paper Open

Is your OCR good enough? Probably so. An assessment of the impact of OCR quality on downstream tasks for Dutch texts

1. University of Amsterdam
2. National Library of the Netherlands

We conduct an assessment of the impact of OCR quality in collections in Dutch, considering two tasks: document classification and document clustering via topic modelling. We find that for both topic modelling (using LDA) and document classification (using a variety of methods, including deep neural networks), working with an OCRed version of a corpus does not in general compromise results. On the contrary, it may sometimes lead to better results. While more work is needed, including on evaluating different datasets and methods, our results further confirm previous work in suggesting that the quality of existing OCR is often sufficient to apply machine learning techniques.

Files

DHBenelux 2021 abstract.pdf

Files (138.6 kB)

Name	Size	Download all
DHBenelux 2021 abstract.pdf md5:e2dac7eebd4c96b1c6f0893bcd9d9bb9	138.6 kB	Preview Download

Citations

Oops! Something went wrong while fetching results.

105

Views

Downloads

Show more details

	All versions	This version
Views	105	105
Downloads	59	59
Data volume	8.7 MB	8.7 MB

More info on how stats are collected....

DOI

Resource type

Conference paper

Publisher

Zenodo

Conference

DH Benelux 2020 #GoesOnline , World Wide Web, 2-4 June 2021

Languages

English

Creative Commons Attribution 4.0 International

The Creative Commons Attribution license allows re-distribution and re-use of a licensed work on the condition that the creator is appropriately credited. Read more

Technical metadata

Created: May 29, 2021
Modified: July 19, 2024

Is your OCR good enough? Probably so. An assessment of the impact of OCR quality on downstream tasks for Dutch texts

Creators

Description

Files

DHBenelux 2021 abstract.pdf

Files (138.6 kB)