Published May 7, 2025 | Version v1.0
Presentation Open

Leveraging AI for Enhanced Archaeological Data Extraction: Workflows for Textual and Image-Based Data

  • 1. ROR icon Czech Academy of Sciences, Institute of Archaeology, Brno
  • 2. ROR icon Czech Academy of Sciences, Institute of Archaeology, Prague
  • 3. ROR icon Charles University

Description

Presentation from a talk given at CAA2025 Digital Horizons conference in session 19. Reusable Digital Research Workflows for Archaeology.

Abstract (En)

The digitization of archaeological archives, particularly grey literature and archival photographs, holds immense potential for knowledge discovery. However, manual processing of such data is labour-intensive and often lacks consistency, making it a prime candidate for automation. This paper presents pilot implementation of re-usable digital research workflows that integrate text and image recognition technologies and AI models to streamline the analysis of archaeological documentation. These workflows are being developed for the purposes of enhancing (meta)data quality in the Archaeological Map of the Czech Republic (AMCR) digital repository and the ARIADNE Knowledge Base and discovery service.

Our approach to textual data leverages OCR/HTR and NLP tools to process archival reports, generating machine-readable text from a combination of manuscripts, typescripts, and printed materials. Through AI-driven information extraction techniques, we prepare models for automated segmentation and OCR/HTR processing of documents. These are implemented through the e-Scriptorium service and a newly developed dashboard. Based on the recognition outputs, LINDAT/CLARIAH-CZ open-source tools are applied for enhanced full-text search (tokenization, tagging, lemmatization, etc.; UDPipe), identification of keywords (KER), and named entities recognition (personal and place names, temporal data, AMCR vocabulary terms, identifiers, etc.; NameTag). The desired goal is to provide an integrated solution that will enable processing of legacy data and new uploads to the AMCR system and offer users more efficient services for searching and processing documents. A secondary objective is to simplify archival procedures by automating some of the steps involved in describing and archiving documents.

Simultaneously, we implement object recognition workflow for the detection and classification of archaeological objects, i.e. artefacts and other objects of interest, in archival photographs. By adapting and fine-tuning deep learning models (e.g., ResNet) for archaeology, we segment and annotate archival photographs according to AMCR controlled vocabularies. Two types of image datasets are used, firstly the images with single finds, photographed often on standardised backgrounds with scales, and secondly images with various content including photographs from fieldwork with trenches, burials, etc. Mappings of the vocabularies used across the datasets to the Getty AAT terms ensures interoperability in the context of ARIADNE infrastructure. This workflow streamlines the process of annotating archival photographs with terms from domainspecific controlled vocabularies and allows identification of archaeological artefacts and other objects of interest, which simplifies the otherwise time-consuming task of creating metadata and at the same time opens new doors for connecting and cross-referencing image data with textual data, e.g. the grey literature find reports.

The talk summarises the journey leading towards the implementation of both of the workflows, discusses what so far worked and what did not, including the dead ends we encountered and what we learned along the way. The current state of workflows’ implementation will be demonstrated on pilot results based on the archival textual and image documents, showcasing how AI technologies can enhance archaeological archives processing and foster further research.

Files

01 pajdla_etal_caa2025.pdf

Files (18.4 MB)

Name Size Download all
md5:256841f8f95f78ba04e1b5028df1270f
18.4 MB Preview Download

Additional details

Related works

Funding

European Commission
ATRIUM – Advancing FronTier Research In the Arts and hUManities 101132163

Dates

Available
2025-06-03