Published May 29, 2021 | Version v1
Conference paper Open

Is your OCR good enough? Probably so. An assessment of the impact of OCR quality on downstream tasks for Dutch texts

  • 1. University of Amsterdam
  • 2. National Library of the Netherlands

Description

We conduct an assessment of the impact of OCR quality in collections in Dutch, considering two tasks: document classification and document clustering via topic modelling. We find that for both topic modelling (using LDA) and document classification (using a variety of methods, including deep neural networks), working with an OCRed version of a corpus does not in general compromise results. On the contrary, it may sometimes lead to better results. While more work is needed, including on evaluating different datasets and methods, our results further confirm previous work in suggesting that the quality of existing OCR is often sufficient to apply machine learning techniques.

Files

DHBenelux 2021 abstract.pdf

Files (138.6 kB)

Name Size Download all
md5:e2dac7eebd4c96b1c6f0893bcd9d9bb9
138.6 kB Preview Download