Digitisation and automated transcription of historical auction catalogues
Description
The Royal Museum of Fine Arts Antwerp (KMSKA) conducted a pilot project with the AI software eScriptorium to investigate how historical auction catalogues with handwritten notes could be digitally accessed. This included 65 digitized catalogues and focused on two processes: transcription and segmentation.
Transcription converts printed and handwritten texts into machine-readable text. Initially, we used a model trained on 19th-century French. This worked reasonably well for printed texts, but failed for handwritten texts. With trial and error, we arrived at a third model based only on the essential types of data: lot number; artist name; artwork title; purchaser and price. This model proved more efficient for both printed and handwritten text. Segmentation is based on page layout analysis and aims to identify the different elements of a page, in our case the same five types of data. To do this, we linked text in the segments to a category of data. Initially, we used the so-called ‘regions’ in eScriptorium to assign the categories manually, but this led to poor performance. An alternative with ‘baseline’ categorization gave better results and was therefore used.
A Python script then converted the output of eScriptorium to a structured Excel file. Further development of the project can focus on refining the transcription and segmentation models for more accurate processing and categorization of printed and handwritten text. A collaboration with the Getty Research Institute offers opportunities for knowledge sharing and integration of the digitized auction catalogues as linked open data, increasing their impact and accessibility.
Files
24_Poster_VeilingcatalogiKMSKA.pdf
Files
(306.9 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:fd357216330198c058ce43b3202fa971
|
306.9 kB | Preview Download |