WaveTransformer: A Novel Architecture for Audio Captioning Based on Learning Temporal and Time-Frequency Information

Automated audio captioning (AAC) is a novel task, where a method takes as an input an audio sample and outputs a textual description (i.e. a caption) of its contents. Most AAC methods are adapted from from image captioning of machine translation fields. In this work we present a novel AAC novel method, explicitly focused on the exploitation of the temporal and time-frequency patterns in audio. We employ three learnable processes for audio encoding, two for extracting the local and temporal information, and one to merge the output of the previous two processes. To generate the caption, we employ the widely used Transformer decoder. We assess our method utilizing the freely available splits of Clotho dataset. Our results increase previously reported highest SPIDEr to 17.3, from 16.2.


Introduction
Automated audio captioning (AAC) is an intermodal translation task, where the system receives as an input an audio signal and outputs a textual description of the contents of the audio signal (i.e. outputs a caption) [1]. AAC is not speech-to-text, as the caption does not transcribe speech. In a nutshell, an AAC method learns to identify the high-level, humanly recognized information in the input audio, and expresses this information with text. Such information can include complex spatiotemporal relationships of sources and entities, textures and sizes, and abstract and high-level concepts (e.g. "several barnyard animals mooing in a barn while it rains outside").
There are different published approaches for AAC. Regarding input audio encoding, some approaches use recurrent neural networks (RNNs) [2][3][4], others 2D convolutional neural networks (CNNs) [5][6][7], and some others the Transformer [8] model [9]. Though, RNNs are known that have difficulties on learning temporal information [10], 2D CNNs model time-frequency but not temporal patterns [11], and the Transformer was not originally designed for sequences of thousands time-steps [8]. For generating the captions, the Transformer decoder [6,9,12] or RNNs [1,3,5] are mostly employed, and the alignment of input audio and output captions is typically implemented with an attention mechanism [7,12]. Also, some approaches adopt a multi-task approach, where the AAC method is regularized by the prediction of keywords, based on the input audio [6,12,13].
In this paper we present a novel AAC approach, based on a learnable representation of audio that is focused on encoding the information need for AAC. We adopt existing machine listening approaches where sound sources and actions are well captured by time-frequency information [11,14], and additionally exploit temporal information in audio using 1D dilated convolutions that operate on the time dimension [15,16], for learning of high-level information (e.g. background vs foreground, spatiotemporal relationships). Additionally, we claim that these two type of information can be combined, providing a well-performing learned audio representation for AAC. To this end, we present an approach which is explicitly focusing on the above aspects. We employ three different encoding processes for the input audio, one regarding temporal information, a second that considers the time-frequency information, and a third that merges the previous two and its output is given as an input to a decoder which generates the output caption.
The contribution of our work is: i) we present the first method that explicitly focuses on exploiting temporal and local time-frequency information for AAC, ii) we provide state-ofthe-art (SOTA) results using only the freely available splits of Clotho dataset and without any data augmentation and/or multi-task learning, and iii) we show the impact on the performance of the different components of our method, i.e. the temporal and local time-frequency information, merging the previous two, or all of them. The rest of the paper is as follows. In Section 2 we present our method. Section 3 presents the evaluation process of our method, and the obtained re-sults are in Section 4. Section 5 concludes the paper and proposes future research directions.

Proposed method
Our method takes as an input a sequence of T a vectors with F audio features, X ∈ R Ta×F , and outputs a sequence of T w vectors having W one-hot encoded words, Y. To do so, our method utilizes an encoder-decoder scheme, where the encoder is based on CNNs and the decoder is based on feedforward neural networks (FFNs) and multi-head attention. Our encoder takes X as an input, exploits temporal and timefrequency structures in X, and outputs the learned audio representation Z ∈ R Ta×F , which is a sequence of T a vectors of F learned audio features. The decoder takes as an input Z and outputs Y. Figure 1 illustrates our proposed method.

Encoder
Our encoder, E(·), consists of three learnable processes, E temp (·), E tf (·), and E merge (·). E temp learns temporal context and frame-level information in X [16], and is inspired by WaveNet [15] but with non-causal convolutions, since in AAC there is no restriction for causality in the encoding of input audio. E tf learns time-frequency patterns in X, and is inspired by SOTA methods for sound event detection [11,14], and E merge merges the information extracted by E temp and E tf .
N t blocks of CNNs (called wave-blocks henceforth) in E temp , sequentially process X. Each wave-block consists of seven 1D CNNs, CNN nt t1 to CNN nt t7 , with n t to be the index of the wave-block. For example, CNN 2 t3 is the third CNN of the second wave-block. The kernel size, stride, and dilation of CNN nt {t1,t4,t7} are one and its padding zero. The kernel size of CNN nt {t2,t3} is three and its padding, dilation, and stride is one. The kernel size of CNN nt {t5,t6} is three, its padding and dilation are two, and stride is one. CNN nt t1 has C nt in and C nt out input and output channels, respectively, and the rest have C nt out input and output channels. The above hyper-parameters are based on the WaveNet architecture [15]. The output of the n t -th wave-block, H nt t , is obtained by where BN nt t is the batch normalization process at the n t -th wave-block, ReLU is the rectified linear unit, σ(·) is the sigmoid non-linearity, is the Hadamard product, Figure 1: The WaveTransformer, with the encoder on the left-hand side and the decoder on the right-hand side operate along the time dimension of X t , allowing H Nt t to learn temporal information from X t [15] and be used effectively in WaveTransformer for learning information that requires temporal context, e.g. spectro-temporal relationships. The time receptive field of each wave-block spans seven timesteps of its corresponding input, leading to a receptive field of 7N t − 1 time-steps of X, for the output of the N t -th waveblock.
E tf employs N tf blocks of 2D CNNs, called 2DCNNblocks henceforth. Each 2DCNN-block consists of a 2D CNN (S-CNN n tf ), a leaky ReLU (LU), and a 2D CNN (P-CNN n tf tf ). Each 2DCNN-block is followed by a ReLU, a BN (BN n tf ) process, a max-pooling (MP n tf ) process that operates only on the feature dimension (hyper-parameters according to [11]), and a dropout (DR) with probability of p n tf . The 2DCNNblocks are inspired by AAC and sound event detection and classification methods, and the recent, successful adoption of depth-wise separable convolutions [11,13,17]. The 2DCNNblocks learn spatial time-frequency information from their input [11], allowing H N d d to be used effectively for the identification of sources and actions [11,17].
S-CNN n tf consists of C n tf in different (5, 5) kernels with unit stride, and padding of 2, focusing on learning time-frequency patterns from each channel of its input. Each kernel of S-CNN n tf is applied to only one channel of the input to S-CNN n tf , according to the depthwise separable convolution model and to enforce the learning of spatial time-frequency patters [11]. P-CNN n tf tf consists of a square kernel of size K P-CNN > 1, with unit stride, and padding of 2, focusing on learning cross-channel information from the output of S-CNN n tf , since the kernels of P-CNN n tf tf operate on all channels of the input to P-CNN n tf tf . While hyper-parameters of S-CNN n tf and S-CNN n tf are based on [11], the usage of K P-CNN > 1 is not according to a typical point-wise convolution (i.e. with a (1, 1) kernel, unit stride, and zero padding), as it was experimentally found that it performs better, using the training and valida-tion data, and the protocol described in Section 3. S-CNN 1 has C n tf in = 1 and C n tf out = C nt out input and output channels, respectively. S-CNN n tf >1 and P-CNN n tf have input and output channels equal to C nt out . The output of the n tf -th 2DCNNblock, H n tf tf ∈ R C n tf out ×Ta×F tf ≥0 , is obtained by )))) and (6) H n tf tf =DR(MP n tf (BN n tf (S n tf tf ))), where H 0 tf = X tf and H N tf , and Z ∈ R 1×Ta×C N tf out is the output of CNN m . Z is then reshaped to {T a × C N tf out } and given as an input to FNN m , as Z = FNN m (Z ), where Z ∈ R Ta×F , with F = C N tf out .

Decoder
We employ the decoder of the Transformer model [8] as our encoder, D(·). During training D takes as an input Y and Z, and outputs a sequence of T w vectors having a probability distribution over W words,Ŷ ∈ [0, 1] Tw×W . We follow the implementation in [8], employing an FFN as embedding extractor for one-hot encoded words, FNN emb (·), a positional encoding process, P enc (·), N dec decoder blocks, D n dec (·), and an FFN at the end which acts as a classifier, FNN cls (·). FNN emb and FNN cls have their weights shared across the words of a caption. Each D n dec consists of a masked multihead self-attention, a layer-normalization (LN) process, another multi-head attention that attends at Z, followed by another LN, an FNN, and another LN. We model each D n dec as a function taking two inputs, U n dec ∈ R Tw×V n dec e and Z, and having as output H n dec dec ∈ R Tw×V n dec e , with H 0 dec = H dec , U 0 = Y, and V 0 e = W . All FNNs of each D n dec have input-output dimensionality of V n dec e . We use N att attention heads and for the multihead attention layers and p d dropout probability. For the implementation details, we refer the reader to the paper of Transformer model [8]. FNN emb takes as an input Y and its output is processed by the positional encoding process, as where P enc is according to the original paper [8]. H dec is processed serially by the N dec decoder blocks, as H n dec dec = D n dec (H n dec −1 dec , Z), and then we obtainŶ aŝ We optimize jointly the parameters of the encoder and decoder, by minimizing the cross-entropy loss between Y and Y.

Evaluation
To evaluate our method, we employ the dataset and protocol defined at the AAC task at the DCASE2020 challenge. The code and the pre-trained weights of our method are freely available online 1 . We also provide an online demo of our method, with 10 audio files, the corresponding predicted captions, and the corresponding ground truth captions 2 .

Dataset and pre-and post-processing
We employ the freely available and well curated AAC dataset, Clotho, consisting of around 5000 audio samples of CD quality, 15 to 30 seconds long, and each sample is annotated by human annotators with five captions of eight to 20 words, amounting to around 25 000 captions [4,18]. Clotho is divided in three splits: i) development, with 14465 captions, ii) evaluation, with 5225, and iii) testing with 5215 captions. We employ development and evaluation splits which are publicly and freely available. We extract F = 64 log mel-band energies using Hamming window of 46ms with 50% overlap from the audio files, resulting to 1292 ≤ T a ≤ 2584, for audio samples which length is between 15 and 30 seconds. We process each caption and we prepend and append the <sos> (start-of-sentence) and <eos> (end-of-sentence) tokens, respectively. Additionally, we process the development split and we randomly select and reserve 100 audio samples and their captions in order to be used as a validation split during training. These 100 samples are selected according to the criterion that their captions do not contain a word that appears in the captions of less than 10 audio samples. We term the resulting training (i.e. development minus the 100 audio samples) and validation splits as Dev tra and Dev val , respectively. We also provide the file names from Clotho development split used in Dev val , at the online repository of WaveTransformer 2 . We post-process the output of WaveTransformer during inference, employing both greedy and beam search decoding. Greedy decoding stops when <eos> token or when 22 words are generated.

Hyper-parameters, training, and evaluation
We employ the Dev tra (as training split) and Dev val (as validation split) to optimize the hyper-parameters of our method, using an early stopping policy with a patience of 10 epochs. We employ Adam optimizer [19], a batch size of 12, and clipping of the 2-norm of the gradients to the value of 1. The employed hyper-parameters of our method are N t = 4, N tf = 3, C nt out = V e = 128, F tf = 1, N dec = 3, N att = 4, p n tf = p d = 0.25, and beam size of 2. This leads to the modelling of 7N t − 1 = 27 frames, equivalent to 0.7 seconds for current X, for E temp .
To assess the performance of WaveTransformer (WT) and the impact of E temp , E tf , E merge , and beam search, we employ the WT, WT without E tf and E merge (WT temp ), without E temp and E merge (WT tf ), and without E merge (WT avg ), where we replace E merge with an average between E temp and E tf . We evaluate the performance of WT with greedy decoding and with beam searching (indicated as WT-B) on Clotho evaluation split and using the machine translation metrics BLEU 1 to BLEU 4 scores, METEOR, and ROUGE L [20][21][22], and the captioning metrics CIDEr, SPICE, and SPIDEr [23][24][25]. In a nutshell, BLEU n measures a weighted geometric mean of modified precision of n-grams, METEOR measures a harmonic mean of recall and precision for segments between the two captions, and ROUGE L calculates an F-measure using the longest common sub-sequence. On the other hand, CIDEr calculates a weighted cosine similarity of n-grams, using term-frequency inverse-documentfrequency weighting, SPICE measures how well the predicted caption recovers objects, attributes, and their relationships, and SPIDEr is the average of CIDEr and SPICE, exploiting the advantages of CIDEr and SPICE.
Additionally, we compare our method with the two highest-performing AAC methods, NTT [12] and TRACKE [6], developed and evaluated using only Clotho development and evaluation splits. NTT uses different components, like multi-task learning (MT), data augmentation (DA), and post-processing (PP), but authors provide results without these components. TRACKE is the current SOTA, it also uses MT but the authors provide results without MT. We compare our WT against TRACKE without MT and NTT without (w/o) DA. Table 1 presents the results of WT, NTT, and TRACKE. As can be seen, the learning of time-frequency information (WT tf ) can lead to better results than learning temporal information (WT temp ) instead. We hypothesize that this be-cause the decoder can learn an efficient language model, filling the connecting gaps (e.g. interactions of objects) between sound events learned from E tf . Though, from the results can be seen that employing both E temp and E tf increases more the performance of the WaveTransformer (WT).

Results
Comparing the different scores for the employed metrics and for the WT tf and WT cases, can be seen that the utilization of E temp is not contributing much in the ordering of words, as indicated by the difference of BLEU metrics between WT tf and WT. We can see that with the E temp , our method learns better attributes of objects and their relationships, as indicated by CIDEr and SPICE scores. Thus, we argue that E temp contributes in learning attributes and interactions of objects, while E tf contributes information about objects and actions (e.g. sound events). Also, by observing the results for WT avg , we can see that a simple averaging of the learned information by E temp and E tf leads to a better description of objects, attributes, and their relationships (indicated by SPICE). Though, as can be seen by comparing WT avg and WT, the E merge manages to successfully merge the information by E temp and E tf . The utilization of beam search (WT-B) gives a significant boost to the performance, reaching up to 18.2 SPIDEr. Compared to TRACKE and NTT methods, we can see that when excluding DA, MT, and PP, our method (WT) performs better. Additionally, WT-B performs better than NTT with MT and PP. Our post-processing consists only on using beam search, where the NTT method involves a second post-processing technique by augmenting the input data and averaging the predictions. Thus, WT surpasses NTT and TRACKE methods, setting the new SOTA of AAC.
Finally, two, high SPIDEr-scoring, captions are for the files Flipping pages.wav, and 110422 village dusk.wav of the evaluation split of Clotho. Our predicted captions for each of these files, using WT-B, are: "a person is flipping through the pages of a book" and "a dog is barking while birds are chirping in the background", respectively, and the best matching ground truth captions are "a person is flipping through pages in a notebook" and "a dog is barking in the background while some children are talking and birds are chirping", respectively.

Conclusion
In this paper we presented a novel architecture for AAC, based on convolutional and feed-forward neural networks, called WaveTransformer (WT). WT focuses on learning long temporal and time-frequency information from audio, and expressing it with text using the decoder of the Transformer model. We evaluated WT using the dataset and the metrics adopted in the AAC DCASE Challenge, and we compared our method against previous SOTA methods and the DCASE AAC baseline. The obtained results show that learning time-frequency information, combined with a good language model, can lead to good AAC performance, but incorporating long temporal information can boost the obtained scores.