Published March 24, 2021 | Version v1
Project deliverable Open

D4.4 Report on Cross-Lingual Content Retrieval Based on Automatic Translation

  • 1. University of Helsinki
  • 2. Aalto University
  • 3. Limecraft

Description

In this deliverable, we report on our automatic content retrieval experiments and their implications for improving the discoverability of archive content, with a focus on cross-lingual retrieval, but also including our additional cross-modal retrieval tests.

First, we introduce the methods we used to simulate a realistic mixed-language media archive using the raw data from a publicly available collection of annotated images. We discuss the ways in which automatic content retrieval on this archive parallels or diverges from content search in the MeMAD prototype platform (Limecraft Flow), to clarify the extent to which they overlap. Afterwards, we describe how we further processed the data, drawing from our expertise in machine translation and image processing in order to enrich the archive, and to improve content retrieval performance. Next, we provide our experimental findings from using textual metadata translations and automatically-generated image captions to expand the metadata, as well as our tests on performing retrieval beyond using simple textual search queries. Our findings unequivocally validate the utility of metadata translations for cross-lingual content retrieval, and further encourage additional venues for cross-modal and multimodal retrieval methods. We describe these findings in detail alongside the empirical scores we have obtained from our own evaluations, and conclude the report with our general impressions and the lessons we have learned from this study.

Files

D4.4-Report on Cross-Lingual Content Retrieval Based on Automatic Translation.pdf

Additional details

Funding

European Commission
MeMAD - Methods for Managing Audiovisual Data: Combining Automatic Efficiency with Human Accuracy 780069