Published November 4, 2024 | Version v1
Conference paper Open

Translating speech with just images

  • 1. POLITEHNICA Bucharest
  • 2. ROR icon Stellenbosch University

Description

Visually grounded speech models link speech to images. We extend this connection by linking images to text via an existing image captioning system, and as a result gain the ability to map speech audio directly to text. This approach can be used for speech translation with just images by having the audio in a different language from the generated captions. We investigate such a system on a real low-resource language, Yorùbá, and propose a Yorùbá-to-English speech translation model that leverages pretrained components in order to be able to learn in the low-resource regime. To limit overfitting, we find that it is essential to use a decoding scheme that produces diverse image captions for training. Results show that the predicted translations capture the main semantics of the spoken audio, albeit in a simpler and shorter form.

Files

oneata24_interspeech.pdf

Files (662.0 kB)

Name Size Download all
md5:47b9327b7b566d6a497d12758679607f
662.0 kB Preview Download

Additional details

Funding

European Commission
AI4TRUST – AI-based-technologies for trustworthy solutions against disinformation 101070190