Published May 6, 2026 | Version v1
Poster Open

A Multimodal Workflow for Synchronizing Archaeological Track Data in Aldène Cave (CAA 2026 Vienna)

  • 1. ROR icon Friedrich-Alexander-Universität Erlangen-Nürnberg
  • 2. Friedrich-Alexander-Universität Erlangen-Nürnberg Institut für Ur- und Frühgeschichte

Contributors

Data manager:

Supervisor:

  • 1. ROR icon Friedrich-Alexander-Universität Erlangen-Nürnberg

Description

This project presents the development of a reproducible multimodal workflow for processing and integrating fragmented archaeological field recordings collected during in-cave documentation of Mesolithic human footprints. The dataset consists of GoPro video segments, externally recorded high-quality audio, and multilingual transcripts in the form of ELAN annotation files containing time-aligned linguistic annotations and translations. During fieldwork, experts used red and green laser pointers to indicate specific footprints and movement patterns on the cave floor, linking their verbal interpretations directly to spatial features.

Due to technical constraints during data acquisition, video and audio recordings were captured on separate devices, resulting in temporal misalignment between modalities. Importantly, the transcripts were created based on the external audio recordings rather than the GoPro video audio. As a result, the transcript timing corresponded to the external audio stream, while the video audio track was temporally offset. This mismatch made it difficult to directly view transcripts in alignment with the video.

To address this issue, synchronization was performed by aligning the external audio stream to the video timeline. A key step in the workflow is the automatic detection of hand-clap events recorded during fieldwork. Two complementary approaches were combined for robust detection: amplitude peak thresholding to identify high-energy signal peaks, and frequency-based filtering to isolate the characteristic acoustic signature of clap sounds. These detected events serve as temporal anchor points, enabling accurate offset estimation between the external audio and the video audio. The calculated offset was then applied programmatically to shift the external audio and all dependent data (including transcripts and subtitles), ensuring that the final output is fully synchronized with the video.

In parallel, ELAN annotation files containing linguistic annotations were converted into machine-readable XML representations and parsed to extract timestamps, speaker labels, and multilingual transcript content. Based on the calculated offset between audio and video, the extracted timestamps were subsequently shifted, ensuring that the generated subtitles are fully aligned with the video. Finally, subtitle files were automatically generated in multiple formats, including Ju’hoan, English, and bilingual subtitles.

To enhance the visual clarity of laser pointers used during recording, a computer vision laser enhancement pipeline was implemented. The red and green laser points used during recording were detected using HSV colour filtering, contour detection, and size-based thresholding. Since the original laser signals are often weak or partially occluded, the detected regions were enhanced and rendered as visual overlays, making them clearly visible in the video frames. This improves the alignment between spoken interpretations and the corresponding visual features.

In the final stage, all processed components were integrated into structured multimedia outputs using FFmpeg. Each output file includes synchronized video, dual audio tracks (original GoPro audio and aligned external audio), multiple subtitle streams, and enhanced laser visualizations. Additionally, a chapter-based navigation structure was generated based on video segment names, allowing efficient navigation between different cave sectors and recording parts. This structure was designed to support seamless transitions and facilitate analysis across segmented recordings.

The workflow follows a modular and script-based architecture, where synchronization, annotation parsing, subtitle generation, laser detection, and multimedia encoding are handled as independent steps. This design ensures reproducibility, transparency, and flexibility, allowing individual components to be adapted or improved without affecting the entire pipeline.

By integrating audio, video, textual annotations, and spatial indicators into a unified and temporally aligned dataset, this approach enhances the interpretability and usability of complex archaeological field data. It also provides a practical framework for incorporating qualitative expert knowledge into structured, data-driven research workflows and can be adapted to other multimodal documentation scenarios in archaeology and cultural heritage contexts.

Files

Piroozfar_caa2026_20260331.pdf

Files (1.6 MB)

Name Size Download all
md5:f5c47ceed9a36d9257516475ea36f26b
1.6 MB Preview Download

Additional details

Funding

Deutsche Forschungsgemeinschaft
The Volp Caves: Contextualisation of Palaeolithic Rock Art 522090020

Software