Planned intervention: On Wednesday June 26th 05:30 UTC Zenodo will be unavailable for 10-20 minutes to perform a storage cluster upgrade.
Published June 1, 2022 | Version v1
Presentation Open

Evaluating the performance and usability of a Tesseract-based OCR workflow on French-Dutch bilingual historical sources

Description

The study of texts using a qualitative approach remains the dominant modus operandi in humanities research (D. Nguyen et al., 2020). While most humanities researchers emphasize the critical examination of texts, digital research methodologies are gradually being adopted as complementary options (Levenberg et al., 2018). These computational practices allow researchers to process, aggregate and analyze large quantities of texts. Analytical techniques can help humanities scholars uncover principles and patterns that were previously hidden or identify salient sources for further qualitative research (Bod, 2013; Aiello & Simeone, 2019). However, to support these and more advanced use cases such as Natural Language Processing (NLP), sources must be digitized and transformed into a machine-readable format through Optical Character Recognition (OCR) (Lopresti, 2009)

 

Despite the fact that OCR software is frequently used to convert analogue sources into digital texts, off-the-shelf OCR tools are usually less adapted to historical sources leading to errors in text transcription (Martínek et al., 2020; Nguyen et al., 2021; Smith & Cordell, 2018). Another disadvantage to these models is that they are very susceptible to noise, resulting in relatively low text detection accuracy. Methods of digital text analysis have the potential to  further expand the field of humanities (Blevins & Robichaud, 2011; Kuhn, 2019; Nguyen et al., 2021). However, as OCR quality has a profound impact on these methods, it is important that OCR-generated text is as accurate as possible to avoid bias (Traub et al., 2015; Strien et al., 2020). Adapting OCR systems to distinct historical sources is not only expensive and time-consuming, but the technical knowledge required to (re)train OCR models is often perceived as a hurdle by humanists (Nguyen et al., 2021; Smith & Cordell, 2018). Consequently, research efforts are often geared towards improving the output of the off-the-shelf OCR tools through a process of error analysis and post-correction (Nguyen et al., 2019). These efforts have resulted in streamlined, domain-specific OCR workflows including OCR4all, Escriptorium and OCR-D (Reul et al., 2019; Kiessling et al., 2019; Neudecker et al., 2019). Despite these efforts, there are limited OCR workflows for non-English and multilingual texts (Strien et al., 2020; Reynaert et al., 2020).


In this short paper we present our OCR workflow approach that proposes a user-friendly solution for bilingual historical texts. We test this on a corpus of art exhibition catalogs from INSERT EXACT PERIOD. These texts from the 19th and 20th century, a time period marked by a major expansion of the printed word, a context that makes OCR highly meaningful as manually processing these texts would be very laborious (Taunton, 2014). This is a corpus of catalogs that record works present at specific exhibitions, the so-called salontentoonstellingen, which were held from 1792 to 1914 in Antwerp, Ghent and Brussels. The catalogs are bilingual - French and Dutch -  printed texts.

Files

DH Benelux 2022 Abstract-1.pdf

Files (4.4 MB)

Name Size Download all
md5:409f53f70cb5d2c83bfe83015bf6700a
216.1 kB Preview Download
md5:286cd0b0afd84aadff47b28e12e384b6
4.2 MB Download