The impact of non-target events in synthetic soundscapes for sound event detection

Detection and Classification Acoustic Scene and Events Challenge 2021 Task 4 uses a heterogeneous dataset that includes both recorded and synthetic soundscapes. Until recently only target sound events were considered when synthesizing the soundscapes. However, recorded soundscapes often contain a substantial amount of non-target events that may affect the performance. In this paper, we focus on the impact of these non-target events in the synthetic soundscapes. Firstly, we investigate to what extent using non-target events alternatively during the training or validation phase (or none of them) helps the system to correctly detect target events. Secondly, we analyze to what extend adjusting the signal-to-noise ratio between target and non-target events at training improves the sound event detection performance. The results show that using both target and non-target events for only one of the phases (validation or training) helps the system to properly detect sound events, outperforming the baseline (which uses non-target events in both phases). The paper also reports the results of a preliminary study on evaluating the system on clips that contain only non-target events. This opens questions for future work on non-target subset and acoustic similarity between target and non-target events which might confuse the system.


INTRODUCTION
The main goal of ambient sound and scene analysis is to automatically extract information from sounds that surround us and analyze them for different purposes and applications.Between the different area of interest, ambient sound analysis have a considerable impact on applications such as noise monitoring in smart cities [1,2], domestic applications such as smart homes and home security solutions [3,4], health monitoring systems [5], multimedia information retrieval [6] and bioacoustics domain [7].Sound Event Detection (SED) aims to identify the onset and offset of the sound events present in a soundscape and to correctly classify them, labeling the events according to the target sound classes that they belong to.Nowadays, deep learning is the main method used to approach the problem.However, one of the main limitations of deep learning models is the requirement of large amounts of labeled training data to reach good performance.The process of labeling data is time-consuming and bias-prone mainly due to human errors and disagreement given the subjectivity in the perception of some sound event onsets and offsets [8].To overcome these limitations, recent works are investigating alternatives to train deep neural networks with a small amount of labeled data together with a bigger set of unlabeled data [3,9,10,8,11].Among them, Detection and Classification Acoustic Scenes and Events Challenge (DCASE) 2021 Task 4 uses an heterogeneous dataset that includes both recorded and synthetic soundscapes [8].This latter soundscapes provide a cheap way to obtain strongly labeled data.Until recently, synthesized soundscapes were generated considering only target sound events.However, recorded soundscapes also contain a considerable amount of non-target events that might influence the performance of the system.
The purpose of this paper is to focus on the impact on the system's performance when non-target events are included in the synthetic soundscapes of the training dataset.The study has been mainly divided into three stages.Firstly, we investigate to what extent using non-target events alternatively during training or validation helps the system to correctly detect the target sound events.Mainly motivated from the results of the first experiment, in the second part of the study, we focus on understanding to what extend adjusting the target to non-target signal-to-noise ratio (TNTSNR) at training improves the sound event detection performance.Results regarding a preliminary study on the evaluation of the system using clips containing only non-target events are also reported, opening questions for future studies on possible acoustic similarity between target and non-target sound events which might confuse the SED system.1 .

Problem definition
The primary goal of the DCASE 2021 Challenge Task 4 is the development of a semi-supervised system for SED, exploiting an heterogeneous and unbalanced training dataset.The goal of the system is to correctly classify the sound event classes and to localize the different target sound events present in an audio clip in terms of timing.Each audio recording can contain more than one event.Some of those could also be overlapped.The use of a larger amount of unlabeled recorded clips is motivated by the limitations related to annotating a SED dataset (human-error-prone and timeconsuming).Alternatively, synthesized soundscapes are an easy way to have strongly annotated data.In fact, the user can easily generate the soundscapes starting from isolated sound events.On the other hand, in most of the recorded soundscapes the target sound classes are almost never present alone.For this reason, one of the main novelties of the DCASE 2021 Challenge Task 4 is the introduction of non-target isolated events in the synthetic soundscapes 2 .This paper explores the impact of the non-target sound events on the baseline system performance, with the final goal of understanding and highlighting how to correctly exploit them to generate realistic soundscapes.

Dataset generation
The dataset used in this paper is the DESED dataset3 [12,13], which is the same provided for the DCASE 2021 Challenge Task 4. It is composed of 10 seconds length audio clips either recorded in a domestic environment or synthesized to reproduce such an environment 4 .The synthetic part of the dataset is generated with Scaper [14], a Python library for soundscape synthesis and augmentation, which allows to control audio parameters.The recorded soundscapes are taken from AudioSet [15].The foreground events (both target and non-target) are obtained from the Freesound Dataset (FSD50k) [16], while the background sounds are obtained from the SINS dataset (activity class "other") [17] and TUT scenes 2016 development dataset [18].In particular, non-target events are the intersection of FUSS dataset [19] and FSD50k dataset in order to have compatibilty with the source separation baseline system.
In this article, we modify only the synthetic subset of the dataset.Starting from the synthetic part of the DESED dataset, we generated different versions of it in order to investigate how non-target events impact the system performance and to what extent their relationship with the target events affects the training phase of the system.The following subsections describe the different subsets used for the experiments, which have been generated using Scaper.

Synthetic training set
The synthetic training set is the same set of data released for the DCASE 2021 Challenge Task 4. It includes 10000 audio clips where both target and non-target sound events could be present in each clip.The distribution of the sound events among the files have been determined considering the co-occurrences between the different sound events.The co-occurrences have been calculate considering the strong annotations released for the AudioSet dataset [20] 5 .A second version of this dataset has been generated where only target events are present.The datasets will be hereafter referred as synth tg ntg (used by the official baseline system) and synth tg for the synthetic subset including target and non-target events and the synthetic subset including only target events, respectively.

Synthetic validation set
The synthetic validation set is the same as the synthetic validation dataset supplied for the DCASE 2021 Challenge Task 4. It includes 3000 audio clips including target and non-target events, which distribution has been defined calculating the co-occurrences between sound events.We generated a second version of the dataset containing only target events.The datasets will be referred to as synth tg ntg val (used by the baseline system) and synth tg val (only target sound events).

Synthetic evaluation set
The synthetic 2021 evaluation set is composed by 1000 audio clips.In the context of the challenge, this subset is used for analysis purposes.We will refer to it as synth tg ntg eval.It contains target and non-target events distributed between the different audio clips according to the pre-calculated co-occurrences.Two different versions of the synth tg ntg eval set have been generated, synth tg eval (only target sound events) and synth ntg eval (only non-target sound events).

Varying TNTSNR training and validation set
With the aim of studying what would be the impact of varying the TNTSNR on the system performance, different versions of synth tg ntg and synth tg ntg val have been generated.In particular, for each of them, three versions have been created.The SNR of the non-target events have been decreased by 5 dB, 10 dB and 15 dB compared to their original value.The original SNR of the sound events is randomly selected between 6 dB and 30 dB, so the more we decrease the SNR, the less the sound will be audible, with some of the events that will not be audible at all.These subsets will be subsequently referred to as synth 5dB, synth 10dB, synth 15dB for the training subsets and synth 5dB val, synth 10dB val, synth 15dB val for the validation subsets.

Public evaluation set
The public evaluation set is composed of recorded audio clips extracted from Youtube videos that are under creative common licenses.This is part of the evaluation dataset released for the evaluation phase of the DCASE 2021 Challenge Task 4 and considered for ranking.The set will be referred to as public.

EXPERIMENTS TASK SETUP
In order to compare the results with the official baseline, we used the same SED mean-teacher system released for this year challenge.More information regarding the system can be found at Turpault et al. [8] and on the official webpage of the DCASE Challenge Task 4. All the different models have been trained 5 times.This paper reports the average of the scores and the confidence intervals related to those.Only for the baseline model we do no report the confidence intervals because we have considered the results using the checkpoint made available for it 6 .The metrics considered for the study are the two polyphonic sound detection score (PSDS) [21] scenarios defined for the DCASE 2021 Challenge Task 4, since these are the official metrics used in the challenge.The scope of these experiments is twofold: understand the impact of non-target events on the system performance and investigate to what extend the TNTSNR helps the network to correctly predict the sound events in both matched and mismatched conditions.In order to do so, we divided the experiment into three stages.The first part of the study is focused on understanding the influence of training the system with non-target events.This experiment is described and discussed in Section 4. Section 5 reports the results and the relative discussion of the second part of the experiment where we investigate if a mismatch in terms of TNTSNR between datasets could have an impact on the output of the system.Section 6 reports preliminary results of the last stage of the experiment, regarding the evaluation of the system on the synth ntg eval dataset, formed by only non-target sound events, in order to investigate if some classes could get acoustically confused at training, having a negative impact on the performance.The last stage has been motivated by the results of the second part of the experiment.

USING TARGET/NON-TARGET AT TRAINING
In the first experiment we concentrate on training the system with different combinations of the training dataset.Table 1 reports the results of the experiment evaluating the system on the public set.We check-marked the columns NT Train or/and NT Val according to if the non-target sound events are present or not in the synthetic sounscapes.From the results it is possible to observe that using nontarget sound events during training and validation improves the performance by a large margin with relaxed segmentation constraints (PSDS2) but only marginally with strict segmentation constraints (PSDS1).In this latter case what matters the most is the use of nontarget sound events during the validation.A possible explanation is that synthetic soundscapes with non-target sound events are actually too difficult and confuse the systems when used during the training but they still help reducing the mismatch with recorded soundscapes during model selection (validation).
Table 2 reports the results considering the synth tg ntg eval and synth tg eval evaluation sets.In all cases the best performance is obtained in matched training/evaluation conditions.The performance obtained on synth tg ntg eval are lower than the performance obtained on synth tg eval even in matched conditions.Not surprisingly, this confirm that including non-target sound events makes the SED task more difficult.Interestingly, as opposed to the previous experiment, the most important here is to have matched conditions during training and to a lesser extent during validation.In order to verify the low impact of non-target sound events at training when evaluating on recorded soundscapes, in the next experiment we investigate a possible mismatch in terms in TNTSNR.Evaluating with public set.

VARYING TNTSNR AT TRAINING
The second part of the study focuses on understanding the impact of varying the TNTSNR at training and validation aiming at finding a TNTSNR condition that could match better the recorded soundscapes.For each TNTSNR, we use similar combinations as the ones used in Section 4, replacing the set without non-target sound events by a set with adjusted TNTSNR.For example, considering the 5 dB case, the combinations considered would be: • training using the synth tg ntg set and validating with synth 5dB val; • training with synth 5dB and validating with synth tg ntg val; • training and validating with synth 5dB and synth 5dB val.
The fourth combination is the official DCASE Task 4 baseline.Repeating the experiment with all the varying TNTSNR, allow us to analyse to what extend the loudness of the non-target events helps matching the evaluation conditions on recorded clips.Table 3, 4 and 5 report the performance on the public set when using a TNTSNR of 5 dB, 10 dB and 15 dB, respectively.When the TNTSNR is 5 dB or 10 dB, the performance changes only marginally between configurations.Increasing the TNTSNR to 15 dB leads to a behaviour more similar to the one obtained in Table 1.The best performance is obtained when training with TNTSNR is 15 dB and validating on synth tg ntg val.This could be explained by the fact TNTSNR 15 dB is a condition closer to that of the recorded soundscapes and the fact that it allows for selecting models that will be more robust towards non-target events at test time.
In the last experiment, we investigate the impact of varying the TNTSNR during validation phase, while using the synt tg for training.Results are reported on Table 6, where it is possible to observe that all of them overcome the baseline or are comparable with it, with the best performance obtained for 10 dB TNTSNR.These experiments could indicate that recorded soundscapes in public in general have a TNTSNR of about 10 -15 dB which should be confirmed by complementary experiments.

EVALUATING ON NON-TARGET EVENTS ONLY
Based on the previous experiments, TNTSNR could be one reason of mismatch between the synthetic soundscapes and the recorded soundscapes.But this could not explain all the performance differences observed here.In particular why in general having lower TNTSNR during training is decreasing the performance regardless of the validation.One possibility is that the system gets acoustically confused by a possible similarity in sound between events when soundscapes tend to be less dominated by target events.So we evaluated the system using the synth ntg eval, where only nontarget events are considered, to see for which classes the system would output false positives.We evaluated the system on the public set; considering the systems trained for the first experiment (see Table 1).Results show that some sound events are detected more than others.For some classes as Speech, this could be explained by the original event distribution (indicated in the first column) but for some other classes as Dishes there is a discrepancy between the original distribution and the amount of false alarms.Interestingly the amount of false alarms is decreased sensibly for most of the classes when including non-target sound events during training.

CONCLUSIONS AND FUTURE WORK
This paper analyzes the impact of including non-target sound events in the synthetic soundscapes of the training dataset for SED systems trained on heterogeneous dataset.In particular, the experiments are divided into three stages: in the first part, we explore to what extend using non-target sound events at training has an impact on the system's performance, secondly we investigate the impact of varying TNTSNR and we conclude the study by analyzing a possible confusion of the SED model in case of false alarms triggered by non-target sound events.
From the results reported on this paper, we can conclude that using non-target sound events can help the SED system to better detect the target sound events, but it is not clear to what extend and what would be the best way to generate the soundscapes.Results show that the final SED performance could depend on mismatches between synthetic and recorded soundscapes, part of which could be due to the TNTSNR but not only.Results on the last experiment show that using non-target events at training decreases the amount of false alarms at test but from this experiment it is not possible to conclude on the impact of non-target sound events on the confusion between the target sound events.This is a first track for future investigation on the topic.Additionally, the impact of the non-target sound events at training on the ability of the system to better segment the target sound events in noisy soundscapes would have to be investigated.A final open question is the impact of the per class distribution of the sound events (both target and non-target) and their co-occurrence distribution on the SED performance.

Table 1 :
Evaluation results for the public set, considering the different combinations of using target and non-target sound events at training and validation.

Table 2 :
Evaluation results for the synth tg ntg eval set and synth tg eval set, considering the different combination of using target and non-target sound events at training and validation.

Table 3 :
Evaluation results for the second part of the experiment, varying TNTSNR by 5 dB (synth 5dB and synth 5dB val).Evaluating with public set.

Table 4 :
Evaluation results for the second part of the experiment, varying TNTSNR by 10 dB (synth 10dB and synth 10dB val).Evaluating with public set.

Table 5 :
Evaluation results for the second part of the experiment, varying TNTSNR by 15 dB (synth 15dB and synth 15dB val).

Table 6 :
Evaluation results of the SED system, training with synth tg, validating with varying TNTNSNR set and evaluating with public set.

Table 7 :
Preliminary evaluation results by classes, evaluating the system with synth ntg eval.Nsys (A): training with synth tg, validating with synth tg val; Nsys (B): training with synth tg ntg, validating with synth tg val; Nsys (C): training with synth tg, validating with synth tg ntg val; Base: baseline using target and nontarget events for training and validation.