10.5281/zenodo.6574958
https://zenodo.org/records/6574958
oai:zenodo.org:6574958
Nynke van 't Hof
Nynke van 't Hof
University of Amsterdam
Vera Provatorova
Vera Provatorova
University of Amsterdam
Mirjam Cuper
Mirjam Cuper
0000-0003-0187-9873
KB, National Library of the Netherlands
Evangelos Kanoulas
Evangelos Kanoulas
0000-0002-8312-0694
University of Amsterdam
OCR error detection and post-correction with Word2vec and BERTje on Dutch historical data
Zenodo
2022
OCR post-correction
Natural Language Processing
Word Embedding Models
historical data
digital heritage
2022-05-23
eng
Poster
10.5281/zenodo.6574957
https://zenodo.org/communities/dhbenelux2022
Creative Commons Attribution 4.0 International
A high quality of OCR-output (Optical Character Recognition) has many benefits. Documents become more accessible to readers and NLP tasks can thrive on the data. However, for many reasons, such as the condition of the documents, the OCR-output of historical documents suffers from a significant amount of errors. This study focuses on detecting and correcting these errors after the OCR process has taken place. It has a focus on Dutch historical data. A comparison will be made between the performance of two methods often used for this over the last few years: word2vec and BERT. While BERT has been shown to substantially outperform Word2Vec on OCR post-correction, the reasons behind this performance gap remain under-explored. From related literature, several pitfalls of word2vec in general were retrieved. This study attempts to find where these pitfalls might occur and compare whether BERT has less (or more) problems with these pitfalls than word2vec. This will give insight not only into the advantages and disadvantages of the used word embeddings for OCR post-correction (on historical data), but also into the application of state-of-the-art methods on historical data, something that these methods have often not been trained on and designed for specifically.