Weakly-supervised Automated Audio Captioning via text only training

Kouzelis, Theodoros; Katsouros, Vasilis

doi:10.5281/zenodo.10732496

Published September 21, 2023 | Version v1

Publication Open

Weakly-supervised Automated Audio Captioning via text only training

1. Institute for Language and Speech Processing

In recent years, datasets of paired audio and captions have enabled remarkable success in automatically generating descriptions for audio clips, namely Automated Audio Captioning (AAC). However, it is labor-intensive and time-consuming to collect a sufficient number of paired audio and captions. Motivated by the recent advances in Contrastive Language-Audio Pretraining (CLAP), we propose a weakly-supervised approach to train an AAC model assuming only text data and a pre-trained CLAP model, alleviating the need for paired target data. Our approach leverages the similarity between audio and text embeddings in CLAP. During training, we learn to reconstruct the text from the CLAP text embedding, and during inference, we decode using the audio embeddings. To mitigate the modality gap between the audio and text embeddings we employ strategies to bridge the gap during training and inference stages. We evaluate our proposed method on Clotho and AudioCaps datasets demonstrating its ability to achieve a relative performance of up to ~ compared to fully supervised approaches trained with paired target data.

Files

wsac.pdf

Files (542.0 kB)

Name	Size	Download all
wsac.pdf md5:c65a5879c176dc7e37beb091844343cf	542.0 kB	Preview Download

Views

Downloads

Show more details

	All versions	This version
Views	59	59
Downloads	42	42
Data volume	23.8 MB	23.8 MB

More info on how stats are collected....

DOI

Resource type

Publication

Publisher

Zenodo

Creative Commons Attribution 4.0 International

The Creative Commons Attribution license allows re-distribution and re-use of a licensed work on the condition that the creator is appropriately credited. Read more
Apache License 2.0

A permissive license whose main conditions require preservation of copyright and license notices. Contributors provide an express grant of patent rights. Licensed works, modifications, and larger works may be distributed under different terms and without source code. Read more

Technical metadata

Created: March 1, 2024
Modified: July 7, 2024

Weakly-supervised Automated Audio Captioning via text only training

Creators

Description

Files

wsac.pdf

Files (542.0 kB)