Attention-based Atrous Convolutional Neural Networks: Visualisation and Understanding Perspectives of Acoustic Scenes

The goal of Acoustic Scene Classification (ASC) is to recognise the environment in which an audio waveform has been recorded. Recently, deep neural networks have been applied to ASC and have achieved state-of-the-art performance. However, few works have investigated how to visualise and understand what a neural network has learnt from acoustic scenes. Previous work applied local pooling after each convolutional layer, therefore reduced the size of the feature maps. In this paper, we suggest that local pooling is not necessary, but the size of the receptive field is important. We apply atrous Convolutional Neural Networks (CNNs) with global attention pooling as the classification model. The internal feature maps of the attention model can be visualised and explained. On the Detection and Classification of Acoustic Scenes and Events (DCASE) 2018 dataset, our proposed method achieves an accuracy of 72.7 %, significantly outperforming the CNNs without dilation at 60.4 %. Furthermore, our results demonstrate that the learnt feature maps contain rich information on acoustic scenes in the time-frequency domain.


INTRODUCTION
To recognise acoustic environments automatically, Acoustic Scene Classification (ASC) [1] has been a main objective of research in computer audition [2,3].It aims at classifying acoustic scenes through computational algorithms including signal processing and machine learning.A variety of applications could benefit from ASC, including mobile robots [4], context-aware computing [5], and wearable devices [6].
Previous methods applied Support Vector Machines (SVMs) [7] and Hidden Markov Machines (HMMs) [8] to ASC.Recently, neural network based methods including fully connected neural networks [9], Recurrent Neural Networks (RNNs) [10,11], and Convolutional Neural Networks (CNNs) [11], have achieved the state-of-the-art performance in ASC.Neural networks are effective at extracting high-level features to classify unseen data.However, previous work for audio classification [12] did not visualise and analyse the internal layers of CNNs.
This paper aims to visualise high-level representations in CNNs.For example, while spectrogram images of audio waveforms are the input, the time-frequency units in a feature map can be localised according to their contribution.This idea of localisation is inspired from image-based object localisation [13].Our work can help better explain which time-frequency components contribute to ASC and can be further used for sound segmentation or separation [14].
There are two difficulties in visualising high-level representations in CNNs.Firstly, the learnt representations depend on global pooling after the last convolutional layer.Global max or average pooling result in accurate classification, but tend to under-or overestimate the units in feature maps [14].Global attention pooling has been proposed to adaptively attend to the units [15,16].However, in [16], the learnt low resolution representations lost the time-frequency details due to stride convolution, which is similar with local pooling layers.
In this paper, we discover that local pooling is not necessary, but the size of the receptive field is important for ASC.We propose to use atrous CNNs [17] with a large receptive field instead of local pooling to fix the size of feature maps.Then, a global attention pooling layer is applied on the feature maps to learn the time-frequency units' contributions.

RELATED WORK
Our proposed attention-based atrous CNNs build on previous work using attention-based CNNs [16].In that work, we extracted attention matrices with a size of 4 × 20 and applied a basic analysis.However, the resulting low resolution feature maps could not describe the time-frequency properties of acoustic scenes in detail.
To fix the size of the feature maps at each convolutional layer, the simplest solution is a vanilla CNN model without local pooling layers.However, this increases both time and space complexities and results in sub-optimisation.Encoderdecoder CNNs were proposed in [18], employing a decoder to up-sample the feature maps using transferred pool indices from the encoder.Similarly, in [19], Fully Convolutional Networks (FCNs) used deconvolutional layers to up-sample the feature maps.However, both encoder-decoder CNNs and FCNs are comprised of pooling and up-sampling layers, therefore require strongly labelled data for pixel-wise classification.Datasets in ASC can be considered weakly labelled as only one acoustic class is annotated for each audio wave.In a separate study [17], atrous CNNs were used with a dilated receptive field instead of pooling and up-sampling, obtaining state-of-art results for the task of semantic image segmentation.Motivated by this success, we herein use atrous CNNs, with a weakly labelled ASC dataset and combined with an attention model to improve the visualisation of representations.

Baseline CNNs
CNNs have been successfully used for tasks of audio classification [12,20,21].In our work, log mel spectrogram images [22] are extracted from audio waveforms as the input of CNNs.The baseline CNN model consists of four convolutional layers.Low-level convolutional layers are designed to extract low-level features; high-level convolutional layers are good at learning more abstract representations such as acoustic sounds patterns [23].A local max pooling operation with a kernel size of 2 × 2 is applied after each convolutional layer to extract the shift-invariant features [24] ( Fig. 1 (a)).Then, a global pooling layer [12] is applied to the final feature maps.Finally, a softmax non-linearity is utilised to predict the probabilities of scene classes.

Atrous CNNs
However, a local max pooling operation in baseline CNNs results in feature maps with a small size (Fig. 1 (a)).Therefore, the feature maps cannot be pixel-wisely mapped to spectrogram images.The simplest solution is to remove all local max pooling layers so that the size of the feature maps is fixed (Fig. 1 (b)).In the experiment section, we will show that the CNNs in Fig. 1 (b) underperform the baseline CNNs.
Interestingly, we discover that this underperformance is not caused by removing local max pooling layers.Instead, it arises from the reduced size of receptive field relative to the input of CNNs.The size of a receptive field is number of frequency bins × number of time frames during the convolution operation.Without local max pooling, the size of a receptive field increases linearly with the number of layers; with local max pooling, it increases exponentially with the number of layers.
We introduce atrous CNNs [25] to improve the performance without local max pooling.Atrous CNNs have been applied to high resolution image segmentation [17] and audio generation [25] by fixing the size of feature maps.Atrous CNNs use dilated convolutional kernels (Fig. 1 (c)); therefore, the size of the receptive field increases exponentially with the number of layers.The dilated convolutional kernel is a sparse kernel so that the number of parameters does not increase compared to the baseline CNNs.

Pooling Mechanism
Each feature map at the final convolutional layer has a size of C × F × T , where C, F , and T denote the number of channels, number of frequency bins, and number of time frames, respectively.Global pooling includes max [12], average [26], and attention pooling [16].Then, a fully connected layer is applied to the output of global pooling to predict the probability of each class.Global max or average pooling has the drawback of under-or overestimating the units in feature maps.
On the other hand, attention pooling can adaptively learn the contributions of the time-frequency units.Attention pooling consists of an attention and a classification branch, where A, P , and C each are the attention, probability, and classification matrices, and Y denotes the probabilities of classes.

Conv
Conv Conv Conv feature maps 64@64×320 feature maps 128@64×320 feature maps 256@64×320 feature maps 512@64×320  In this paper, we additionally apply Region of Interest (ROI) pooling [27] followed by global max pooling for an experimental comparison.ROI pooling is achieved by a local max pooling operating in a 16 × 16 aliquoted feature map at the final convolutional layer, to bring about the same effect of the baseline CNNs using four 2 × 2 max pooling layers.

Database
Our proposed approach is evaluated on the development set of the ASC task of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2018 challenge [28].The dataset contains 10 acoustic scene classes and each audio recording has a duration of 10 seconds.This ASC task consists of two subtasks defined by matching or mismatching devices.

Setup
Log mel spectrogram images with a size of 64 mel frequency bins and 320 time frames are extracted from the audio recordings with a Hamming window size of 2 048.The overlap is set to satisfy that 320 time frames are sampled in single spectrogram.We train the models for 15 000 iteration steps with a batch size of 16 to use single Graphics Processing Unit (GPU) sufficiently.The 'Adam' optimiser [29] is employed with an initial learning rate of 0.001.The learning rate is decreased by a factor of 0.9 at every 200 iteration steps to stabilise the training procedure.The set-up of the number of mel frequency bins and initial learning rate are empirical.

Results and Discussion
We apply different global poolings on the baseline CNNs, CNNs without local max pooling and atrous CNNs.Their results are shown in Table 1.To reduce the risk of overfitting caused by excessive parameters for 10-class classification, we only experiment flattening on the baseline CNNs which have feature maps with a small size of 4 × 20.In the baseline CNNs, Table 1.Performance comparison of CNN topologies with flattening and five global pooling models, including max, average ('avg'), ROI, attention ('att'), and the combination of ROI and attention ('roi+att'), evaluated on two subtasks (SUBA on device A and SUBB on three devices A, B, and C) of accuracy.the resolution of the feature maps as 64 × 320, which can be visualised to observe the contributions of the time-frequency components in a feature map.The class-wise accuracies are shown in Table 2. Our proposed model performs well for most classes on devices B and C, except tram.We think this might be caused by a lot of noise in recordings of tram by devices B and C.

Visualisation of the Feature Maps
The feature maps of the attention model are visualised in Fig. 3.For different acoustic scene classes, the contributions of each time-frequency unit are different.For example, airport, park, and street traffic mainly contain stationary background noise so that most time-frequency units have similar weight values.The temporal continuity at several fixed mel-frequency bins appears in the traffic environments, including bus, metro, and tram.The feature maps of public square, shopping mall, and street pedestrian indicate that some audio events like speech occurred.

CONCLUSIONS
This paper proposed attention-based atrous convolutional neural networks (CNNs) to visualise and understand acoustic scenes.Four dilated convolutional layers followed by a global attention pooling model were used to fix the size of feature maps for a visualisation.Our proposed model performed significantly better than the CNNs without dilation on the Detection and Classification of Acoustic Scenes and Events (DCASE) 2018 challenge tasks.Moreover, the time-frequency information in feature maps were visualised and analysed.
In future works, feature level attention models will be investigated to reach a deeper visualisation of CNNs.Further, CNNs followed by sequence to sequence learning methods and 3D CNNs will be considered to investigate the temporal information in acoustic scenes.

Fig. 2 .
Fig. 2. The framework of our proposed attention-based atrous CNNs.The log mel spectrogram images with a size of 64 × 320 are fed into CNNs with four dilated convolutional layers and an global attention pooling layer.The size of the feature maps is represented as number of channels@f requency bins × time f rames, and the size of holes within a kernel is adapted by 'rate'.

FrameFig. 3 .
Fig. 3. Heat maps with a size of 64 × 320 are the visualisation of the attention matrix A in our attention-based atrous CNNs.The horizontal and vertical axes each represent the time frames and frequency bins.
This work was partially supported by the European Union's Horizon H2020 research and innovation programme under Marie Skłodowska-Curie grant agreement No. 766287 (TAPAS), the EPSRC grant EP/N014111/1 "Making Sense of Sounds", and a Research Scholarship from the China Scholarship Council (CSC) No. 201406150082.We thank Judith Dineley for her proofreading work.

Table 2 .
The class-wise accuracies of the result of attentionbased atrous CNNs, which lead to the best results on two subtasks (SUBA on device A; SUBB on devices A, B, and C).