Underwater Acoustic Detection and Localization with a Convolutional Denoising Autoencoder

Detecting and tracking moving targets is a challenging task, which becomes even harder in underwater scenarios due to the extremely low levels of signal-to-noise ratio associated with common acoustic measures. In the context of continuous marine monitoring, a further challenge is provided by the need to deploy computationally efficient methods that guarantee minimum use of power resources in off-shore monitoring platforms. Here we present a novel approach to accurately detect and track moving targets from the reflections of an active acoustic emitter. Our system is based on a computationally- and energy-efficient deep convolutional denoising autoencoder. System performance is evaluated both on simulated and emulated data, and benchmarked against a probabilistic tracking method based on the Viterbi algorithm.


I. INTRODUCTION
T HE importance of continuous long-term biomass monitoring of mobile large bio-fauna like fish, sea turtles, or marine predators is a game-changer in understanding the healthiness and balance of the marine ecosystem [1]. This requires an effective statistical tool for fish aggregation. Surveying natural marine bio-fauna populations have traditionally focused on fish capture; however, modern acoustic methods allow for a more effective detection and quantification of fish abundance and distribution.
Detecting of submerged targets through active acoustics involves detecting the specific target's-based reflection within the reflected signal, which includes stationary reflections from e.g., rocks or chains, as well as reflections from waves or volume scatters, referred to as clutter noise. Some approaches for active acoustic detection involve an array of receivers to compensate for the high clutter (e.g., [2]). However, this greatly limits the system's setup, since arrays have to be stationary in order to allow directionality. Instead, we focus on the practical setup of a single transceiver that can be deployed from small vessels or even from a kayak. Further, since realtime detection is needed for both online monitoring of marine animals and detection of threats, the computational complexity of the detection system must be limited.
In this paper, we offer a novel machine-learning approach to identify patterns in real-time within a time-delay (TD) matrix A. Testolin  formed by concatenating matched filter's outputs sequentially. As presented in the example in Fig. 3 , we view these patterns as curved lines in the matrix represented by an image. We identify these lines using a convolutional denoising autoencoder (CDA) [3], whose objective is to produce as output a denoised image containing the target path cleaned from background and clutter noise. To the best of our knowledge, the proposed approach constitutes the first attempt to apply deep learning for identifying targets within a reflected acoustic signal. Our results show that even in low signal-to-clutter ratio (SCR), where the reflection pattern from the target is weak, our method yields a favorable trade-off between precision and recall, which exceeds the performance of probabilistic approaches (i.e., the Viterbi algorithm) at a much lower computational complexity, and also allows for a more accurate fine-grained tracking of the target path. Further, since detection and tracking are made per sample within the TD matrix, our approach is scalable with the number of identified targets. Our contribution therefore holds the potential to detect submerged targets using a single transceiver over real-time systems.
The remainder of this paper is organized as follows. In Section II, we describe the state-of-the-art in underwater acoustic target tracking. Our system's model is described in Section III. In Section IV, we provide the details of the convolutional denoising autoencoder. Performance is analyzed in Section V, and conclusions are drawn in Section VI.

II. RELATED WORK
To identify a target within a clutter, the common approach is to emit wideband signals and accumulate matched filter (MF) responses of the received reflections [4]. The high processing gain of the wideband signals highlight reflections from the ambient noise, allowing to concentrate on identifying the target within clutter reflections [5]. The reverberation patterns are accumulated in a TD matrix, and tracking is employed to identify the target as a pattern within the clutter [6]. Yet, tracking requires prior knowledge regarding the motion pattern of the tracked target, which may not be available.
An alternative approach is probabilistic detection in the framework of a track-before-detect (TBD) approach. A maximum likelihood TBD was offered in [7] and [8], where the probability of the target's reflections is evaluated and tracking is performed by data association. Dynamic programming is an alternative to evaluate the track of the target by considering the matched filter's output as metrics of likelihoods. Such is the use of the Viterbi algorithm in our recent publication [9] or the use of hidden Markov models for tracking [10], which consider the reflections' range as problem states. Yet, while being robust to target types, the complexity of dynamic programming does not allow real-time analysis due to the high number of states (tens of thousens). Multi-hypothesis tracking that parameterize the probability density function of the target's reflection is an alternative aimed to reduce the complexity (see [5] for a survey of such methods). Among these is the method in [11] that combats time-variations in the reflection patterns, and the method in [12] that computes the multi-hypothesis probabilities by an histogram. Yet, for low SCR (as in the case of tracking marine animals or scuba divers with neoprene cover of their tank) results are often poor. Further, the complexity of the solution increases with the number of targets.
A promising framework to overcome the limitations of traditional methods is given by machine learning, which allows to perform pattern recognition efficiently without the need for domain-specific knowledge about the signal characteristics. In particular, deep learning [13] represents the state-of-the-art in most challenging pattern recognition problems such as image classification [14] and speech recognition [15]. The advantage of deep learning over other popular machine learning methods is that the data is encoded using multiple levels of representations, thus allowing for an effective extraction of the relevant features directly from the data. Moreover, once trained deep neural networks are computationally very efficient, since signal processing can be carried out in parallel hardware using basic algebraic operations [16], [17]. Deep learning has been successfully applied also in telecommunication settings [18], [19] and for signal detection under very noisy conditions [20], making it a promising candidate for our challenging underwater scenario.

A. Assumptions and Goals
Our system includes a single transceiver that can be deployed from a small surface vessel or a buoy. For simplicity we assume the transceiver is stationary, although extension to a mobile scenario is straightforward if the motion of the deploying vessel can be measured. The signals transmitted are short and have a narrow auto-correlation response to obtain a high processing gain. Example is the chirp signal, whose cross-correlation is also tolerant to Doppler shift. The signals are emitted periodically. For each emission, the transceiver records the reflecting signal from the channel. To cover large area, we consider an omni-directional transceiver.
For each emission i, the received signal buffer in th time domain, r i (t), is matched with the transmitted signal of duration T s , s(t), 0 < t < T s , using the normalized matched filter (NMF), which is used to suppress noise variations within signal s(t), and to smooth the reflection response. We filter the output of the NMF to leave non-zero elements in NMF(i) for only those samples that pass a threshold set by the analysis in [21]. The filtered NMF outputs are then combined to form a TD matrix, M , whose rows are the emission index, and whose columns are the NMF samples. Denote bufferss andr i as vectors containing the samples of s(t) and r ( t), respectively. Since, for wideband signals, where * is the convolution operation, I is the Kronecker delta function, and h(i) is the reverberation channel impulse response for the ith signal. The TD matrix M is a representation of the arrival times and power of the reflecting signal. Without knowledge of the target type, we rely only on the TD matrix, and avoid using features like spectrum response. The size of each TD matrix is fixed at 20 rows (corresponding to 20 emissions) and 10k columns (corresponding to sample reflection response of 0.5 s at sampling frequency of 20kHz). To save on power, the emissions are periodic such that TD matrix are replaced for each analysis.
Two main tasks are defined: • Detection, which operates image-wise and requires to identify the presence of a mobile target within the reflected pattern of the whole TD matrix. This is the first step within the processing chain to characterize the mobile target, and it is evaluated by measuring the receiver operating characteristics (ROC) curve. • Localization, which operates pixel-wise and requires to accurately identify the target movement by tracking the curved lines in the TD matrix. This is the second step in the processing chain, and it is evaluated by measuring the summed (Euclidean) distances between the real and predicted target position at each time step.

B. Data Sets
To train and validate our system we consider both a simulated database and an emulated one. The simulated set is totally synthetic. It includes TD matrices of clutter generated from both Gaussian distribution and Beta distribution to represent reverberation noise, while the target is a fixed width line of samples forming a Markov chain. The line's location in the first row is randomly uniformly placed. Then, the following locations within each row are arranged in a random walk fashion while maintaining constraints about the maximum speed of the tracked object. The process then repeats itself for more than one target. The result is a "line" of at least one target spanning across the matrix rows. The synthetic data was mostly used for calibrating the autoencoder architecture and exploring the learning hyperparameters.
The second dataset allows to test the performance of the system under more realistic conditions. The emulation includes both real clutter and real target's-based reflection, both obtained from sea recordings. The target's reflections are identified from 20 sea experiments using the process described in [9] where also the process of identifying stationary reflection is described. Each experiment included more than 1 hour of data collection for different kind of targets including sharks and Input TD matrix Output TD matrix Latent Space Representation Fig. 2. Graphical representation of the CDA. The noisy TD matrix is given as input and processed by a hierarchy of convolutional layers that detect increasingly more complex features in the signal to build a compressed, latent space representation of the image. The representation is then projected back into the image space by deconvolutional layers to produce the denoised matrix. pelagic fish. The experiments were performed under controlled conditions, where first a fish was caught, then measured, and then released again. Samples which weren't identified as targets constitute clutter. To augment the emulation database, the emulation TD matrices are created from partial target detections, resulting in a dataset with more than 20000 images. The process is similar to that of the simulation, while exchanging the random-type target-like reflections with real detections. The resulting SCR range varied between 20 and -10. An example of such pattern is shown in Fig. 1. Besides images containing targets, both datasets also include a similar number of images containing no targets.

IV. THE CONVOLUTIONAL DENOISING AUTOENCODER
A graphical representation of our deep convolutional denoising autoencoder (CDA) is given in Fig. 2. The network receives as input a noisy image (i.e., the TD matrix) and returns as output a denoised version of the same image, where only the target path is present (output = 1) while the noise is suppressed (output = 0). The CDA is composed of four convolutional layers with, respectively, 24, 48, 72 and 96 filters of size 4 × 4, 6 × 6, 8 × 8 and 12 × 16. After each layer, a pooling layer with pool size 1 × 2 and stride 1 × 2 are included. Convolutional layers are followed by four deconvolutional layers of the same size, using nearest neighbor as upsampling function. Rectified linear units were used in all hidden layers, while a logistic activation function was used in the output layer. This allows to interpret the output of the CDA as probability values.
1) CDA training: The autoencoder was implemented in TensorFlow [22]. The network was trained with error backpropagation, using as loss function the cross entropy between network predictions and ground truth binary images containing the real target paths. Due to the unbalanced distribution of active pixels (target positions) compared to clutter or background noise pixels, a weighted cross entropy function was used. Training occurred over mini-batches of size 100 and proceeded for 5000 epochs.
2) CDA testing: The autoencoder was tested on separate test sets, on both tasks defined above. For the simulated dataset, generalization capability was assessed by training the CDA on images with only one target (or no target at all), while the test set included images with up to 10 targets. For the emulated dataset, the test set was created by randomly selecting 50% of the images in the database. For the binary detection task, which is defined image-wise, we computed the sum of pixel-wise predictions and used it as a level of confidence for the presence of a target. As the SCR decreases, the network will be less confident in predicting the presence of a path, which will be reflected in a reduced activation of the output neurons. For localization, the predicted path corresponded to the set of neurons with higher activation.

A. Performance on Simulated Data
For the sake of brevity, for simulated data we only show some examples of denoised TD matrices in Fig. 3. Even from this qualitative assessment it is evident that the CDA is able to accurately reconstruct the underlying paths, even for challenging levels of SCR where the lines in the image are almost invisible to human eye. As expected, predictions become blurred as the signal gets weaker, but overall the CDA output closely matches the ground truth. It is worth noticing that the CDA is able to track multiple targets even if it was only trained on images containing single paths.

B. Performance on Emulated Data
For emulated data, ROC curves for the detection task at four different levels of SCR are shown in Fig. 4. When the signal is clear (i.e., SCR = 10), the CDA detection is almost perfect: the false positive rate is close to zero, while the true positive rate is close to one. As the signal gets weaker performance gradually deteriorates, but it remains fairly good even when the SRC is low. For comparison, in Fig. 5 we show the ROC of the Viterbi algorithm for the same levels of SCR. Performance for the highest SCR is still good (though far from ceiling), but all curves flatten out quickly as the SCR decreases. For low SCR (i.e., SCR ≤ 0) the Viterbi performance is close to chance level (cfr. with random baseline).
For the accurate localization task, average Euclidean distances of track error for the same levels of SCR discussed above are reported in Fig. 6. Prediction error is very small for the CDA, regardless of the SCR level. On the other hand, the performance of the Viterbi algorithm greatly deteriorates as TD matrix Ground truth CDA predictions Fig. 3. Examples of noisy and denoised simulated matrices, along with ground truth, for a relatively high SCR (top) and for a low SCR (bottom).   the signal gets weaker, suggesting that this method cannot be applied when the signal gets too noisy.

VI. CONCLUSIONS
In this paper we described a novel application of deep learning for the acoustic detection and localization of moving targets in underwater scenarios. The proposed system is based on a convolutional denoising autoencoder, which takes as input a noisy image representing the time-delay matrix of the signal and returns as output a denoised matrix, where only the targets' path are highlighted. Our results on both simulated and realistic signals showed that this approach is a promising alternative to more traditional methods in terms of both performance and complexity. A promising research direction would be to combine deep learning with dynamic programming: the former can serve as an efficient pre-processing step; if targets are detected, the Viterbi algorithm can then be applied directly on the denoised matrix to provide a more precise path tracking.