Extracting research data from historical documents with eScriptorium and Python

Kamlah, Jan; Schmidt, Thomas; Shigapov, Renat

doi:10.5281/zenodo.7373135

Published November 28, 2022 | Version v1

Presentation Open

Extracting research data from historical documents with eScriptorium and Python

This talk presents a workflow based on eScriptorium and Python to extract research data from historical documents. eScriptorium is a rather young transcription tool and uses the OCR engine Kraken. The software offers not only the possibility of optimally adapting the text recognition, but also the layout recognition to the source material by means of training. Due to the high research data quality requirements, this step is necessary in many cases. By using existing base models, the training effort can be drastically reduced. The text recognition results can then be exported in PAGE-XML format for further processing. For this purpose, the Python tool “blatt” was developed within the project. It can parse the PAGE-XML exports, sort and extract the contents using algorithms and templates, and convert them into a structured table format such as CSV. In the first part of the presentation there is small introduction to the topic, the source material and the research question. Then we show how a training process based on a base model with minimal training data can be performed using the software eScriptorium and which problem to pay attention to. In the last section, the Python tool “blatt” is presented, as well as the underlying ideas and algorithms.

Files

NFDI-Workshop-Research-Data-Maschinenindustrie-DE.pdf

Files (4.1 MB)

Name	Size	Download all
NFDI-Workshop-Research-Data-Maschinenindustrie-DE.pdf md5:3923796a6b4230070cbd37a579b83e89	2.0 MB	Preview Download
NFDI-Workshop-Research-Data-Maschinenindustrie-EN.pdf md5:3341d90c8cca41fd27db87b19a4d4844	2.0 MB	Preview Download

	All versions	This version
Views	757	744
Downloads	397	394
Data volume	877.1 MB	871.0 MB

Extracting research data from historical documents with eScriptorium and Python

Creators

Description

Files

NFDI-Workshop-Research-Data-Maschinenindustrie-DE.pdf

Files (4.1 MB)