WaveTransformer: An Architecture for Audio Captioning Based on Learning Temporal and Time-Frequency Information

An Tran; Konstantinos Drossos; Tuomas Virtanen

doi:10.5281/zenodo.5723160

Published August 23, 2021 | Version v1

Conference paper Open

WaveTransformer: An Architecture for Audio Captioning Based on Learning Temporal and Time-Frequency Information

1. Audio Research Group, Tampere University, Finland

Automated audio captioning (AAC) is a novel task, where a method takes as an input an audio sample and outputs a textual description (i.e. a caption) of its contents. Most AAC methods are adapted from image captioning or machine translation fields. In this work, we present a novel AAC method, explicitly focused on the exploitation of the temporal and time-frequency patterns in audio. We employ three learnable processes for audio encoding, two for extracting the temporal and time-frequency information, and one to merge the output of the previous two processes. To generate the caption, we employ the widely used Transformer decoder. We assess our method utilizing the freely available splits of the Clotho dataset. Our results increase previously reported highest SPIDEr to 17.3, from 16.2 (higher is better).

Notes

The authors wish to thank D. Takeuchi and Y. Koizumi for their input on previously reported results, and to acknowledge CSC-IT Center for Science, Finland, for computational resources. Part of the needed computations was implemented on a GPU donated from NVIDIA to K. Drossos. Part of the work leading to this publication has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 957337, project MARVEL.

Files

EUSIPCO2021_Tran_et_al_WaveTransformer.pdf

Files (236.0 kB)

Name	Size	Download all
EUSIPCO2021_Tran_et_al_WaveTransformer.pdf md5:62a2d9b7b4673ce9ddc879fe376ab412	236.0 kB	Preview Download

Additional details

Is supplemented by: Software: https://github.com/haantran96/wavetransformer (URL); Dataset: 10.5281/zenodo.4783391 (DOI)

European Commission
MARVEL – Multimodal Extreme Scale Data Analytics for Smart Cities Environments 957337

	All versions	This version
Views	253	253
Downloads	214	213
Data volume	51.9 MB	51.7 MB

WaveTransformer: An Architecture for Audio Captioning Based on Learning Temporal and Time-Frequency Information

Notes

Files

EUSIPCO2021_Tran_et_al_WaveTransformer.pdf

Files (236.0 kB)

Additional details

Related works

Funding

WaveTransformer: An Architecture for Audio Captioning Based on Learning Temporal and Time-Frequency Information

Creators

Description

Notes

Files

EUSIPCO2021_Tran_et_al_WaveTransformer.pdf

Files (236.0 kB)

Additional details

Related works

Funding