River segmentation for flood monitoring

Floods are major natural disasters which cause deaths and material damages every year. Monitoring these events is crucial in order to reduce both the affected people and the economic losses. In this work we train and test three different Deep Learning segmentation algorithms to estimate the water area from river images, and compare their performances. We discuss the implementation of a novel data chain aimed to monitor river water levels by automatically process data collected from surveillance cameras, and to give alerts in case of high increases of the water level or flooding. We also create and openly publish the first image dataset for river water segmentation.


I. INTRODUCTION
Floods report global annual economic losses of $96 billion [1]. They are mainly due to river water overflows, which are caused either by heavy precipitations or by rapid snow melting. To improve river monitoring and early warnings, video cameras can be installed in riverbeds to assess the water level status. The most straightforward way to visually determine if an alert threshold is reached is to use static cameras pointed toward the riverbed and compare the water level against historical observations. However, the manual monitoring of video cameras is very costly. In this work, we propose to use a water segmentation technique to analyze video streams in real-time in order to automatically detect anomalies such as sudden water extend increases. Hence, we train and test three different Deep Learning algorithms for the task of water segmentation and compare their performances.
In Section II we make a brief summary of computer vision algorithms for flood prevention and detection, together with object segmentation algorithms. In Section III we describe the three algorithms used to perform the water segmentation, while in Section IV we introduce the dataset created for water segmentation. Section V is devoted to the explanation of the evaluations carried out to compare the selected algorithms, and of the performance metrics used. Also, we discussion the obtained results and the potential use of this tool to implement early warning in case of floods. Finally, in Section we present our conclusions and outline future works.

II. RELATED WORK
The incremental usage of surveillance cameras in areas prone to natural disasters has raised interest in such events in the scientific community, especially in the computer vision [2] domain.Background subtraction techniques together with morphological operations and color probability has been used to determine water presence in videos [3]. When it comes to static images, most algorithms are based on light, texture and color features, and on clustering or classification models to segment the regions containing water [4]. The main drawback of these algorithms is the usage of handcrafted features which work well only the specific context in which they were created, because they are dependent on image characteristics such as lighting conditions, water color, etc. Moreover, the comparison among such algorithms is difficult because all previous studies are evaluated on nonpublicly-available data.
For the water segmentation task we investigate the use of Artificial Neural Networks, which are composed by a collection of connected computational units called neurons. The connection of several neurons forms a neural network, which is characterized by a set of weights (one for each connection) and biases (one for each neuron). Weights and biases are the parameters that define the differentiable function representing the neural network, and they are generally learned through a supervised approach. The neural network is commonly organized in layers, where the greater number of layers enable for more complex models. When the networks are composed by more than one layer they are referred as Deep Neural Networks and they are part of the Deep Learning family of machine learning algorithms. Deep Learning methods have been applied in many different fields such as medicine and medical diagnosis, data mining or pattern recognition.
In the last few years, the breakthrough of Deep Learning has stirred the field of computer vision. Recently, an algorithm based on deep features was introduced to determine the presence of flood in images from social media [5]. The results reported are promising, however the output is a probability of the image containing evidence of a flood. Therefore, this algorithm can only be used once the disaster has already occurred and does not support flood monitoring.
Semantic segmentation of images is the process of classifying with a semantic class every pixel of an image according to the human perception. Semantic segmentation has been a classic topic in computer vision, which has seen great advances through the application of Deep Learning algorithms. Initially, algorithms that performed pixel-wise labeling through visual features were introduced [6], [7]. Then, fully convolutional networks were successfully applied to the problem of semantic segmentation, improving performances over previous methods. As a general rule, deeper networks extract higher semantic information but at the same time they lose pixel location information. Several techniques have been studied for solving this drawback, such as bilinear interpolation [8], unpoooling operations [9], [8] or skip-layers, which combine fine information from early layers with coarse information from deeper layers [10], [11]. Algorithms based on Adversarial Networks have also been applied to segmentation problems [12], [13].
Despite the advances of Deep Learning semantic segmentation algorithms, to the best of our knowledge they have not been applied to the problem of water identification in images, yet. In this paper, we train three different stateof-the-art algorithms [10], [11], [12] on a novel publicly available dataset of river images, which we introduce in Section IV.

III. SEMANTIC SEGMENTATION ALGORITHMS
In order to anticipate floods caused by water overflow in rivers, we propose to automatically detect an increase in river water levels through water segmentation from videos taken by surveillance cameras installed near riverbeds. The algorithm outputs a water percentage, which can be mapped to the water level increase. When such percentage goes above a certain threshold, which is computed from historical observations, a warning can be triggered. Therefore, it is crucial to achieve a fine water segmentation in order to perceive variations in water levels between frames in a time series. Convolutional Neural Networks (CNN) typically use subsampling to keep filters small and to reduce computational costs. By doing so, the output feature map from deeper layers is reduced, resulting in an segmentation prediction of a smaller size of the input image. Early algorithms based on CNN for semantic segmentation were normally composed by a CNN phase to output a fixed size coarse segmentation map, followed by an upsampling algorithm, and in some cases a refinement phase. This algorithms were not trainable end-to-end and could only work with images of a fixed input size due to the presence of fully connected layers operating with a fixed sized input. Later, Fully Convolutional Networks (FCN) were introduced to solve the problem of semantic segmentation [10], [14], [11]. Using a fully convolutional network allows to process input images of variable size. In order to do so, fully connected layers are transformed into convolutions with kernels which cover the entire input region. Moreover, in this architectures upsampling layers were also introduced to obtain an output of the same resolution of the input and skip layers, which combine finer information from earlier layers with semantically more relevant information from deeper layers, making the whole algorithm trainable endto-end. Given the advantages of FCN we have chosen two stat-of-the-art algorithms [10], [11] based on this technique which have reported top performances in well known segmentation datasets such as PASCAL [15] and CamVid [16]. Additionally, we compare the performances obtained with a segmentation algorithm based on Conditional Adversial Networks [12], which has shown very good results in several computer vision tasks, including semantic segmentation. To the best of our knowledge this algorithm has not yet been quantitatively compared with FCN segmentation algorithms.
Next, we briefly explain the three different algorithms chosen for the task: 1) Fully convolutional networks for semantic segmentation (FCN-8s) [10]: One of the first proposed FCN for semantic segmentation, adapting contemporary classification networks such as AlexNet [17], VGGnet [18] and GoogLeNet [19] into FCN. To produce dense predictions they propose to do bilinear interpolation and upsampling using convolutions with an input stride of 1/f , where f is the factor needed in order to reverse the forward and backward passes of the convolution. The deconvolutional filters are also learned. Skip layers are added to combine the prediction layer with previous layers with finer strides 1 .
2) Fully Convolutional DenseNets for Semantic Segmentation (Tiramisu) [11]: Similarly to FCN8-s, in Tiramisu a CNN network proposed for image classification is adapted into a FCN network with skip and upsampling layers to perform semantic segmentation. However, Tiramisu is based on an CNN architecture called DenseNet which was first introduced in [20]. This architecture is built from a series of dense blocks that are iterative concatenation of previous feature maps, naturally introducing skip connections and multi-scale supervision. Additionally, upsampling operations called transition up are added to recover the resolution of the input image. This layers consist in a transposed convolution that upsamples the previous feature maps. In order to avoid the linear growth in the number of features due to the upsampling paths, the input of a dense block is not concatenated with its output and thus the transposed convolution is applied only to the feature maps obtained by the last dense block.
3) Image-to-Image Translation with Conditional Adversarial Networks (Pix2Pix) [12]: These networks learn to map an input image to an output image and the loss function to train this mapping. They can be applied to multiple tasks such as synthesizing photos, reconstructing objects from edge maps or semantic segmentation among others. The main advantage of this architecture is that it does not rely on a manually fixed loss function that forces the network to learn the task. The network is composed by a generator G which produces outputs which ideally can not be distinguished from the "real" images and a discriminator D which learns to distinguish between the images generated by G and the "real" ones. In order to overcome the losses in resolution due to the progressive downsampling of the features, they also use skip connections which consist of a concatenation of all the channels from layer i with those from layer n − i, where n is the total number of layers. Moreover, they combine the the GAN objective with an L1 distance to the ground truth in order to impose to the generator to generate an output similar to the ground truth 2 .

IV. DATASET
We present a publicly available dataset for water segmentation in rivers [21]. It has been created using self-gathered images, images retrieved from Google, and images from surveillance cameras in riverbeds and labeled by a human annotator. The dataset contains a total of 300 images of different rivers. For every image there is the corresponding ground truth file which consists of a two-dimensional binary matrix of zeros for the pixels which contain background information and ones for the pixels which contain water information. In Figure 1 (a-c) we show some examples of the images in the dataset, while in Figure 1 (d-f) the corresponding water ground truth is displayed. Water regions are represented in white and the background in black. The dataset has a big variance among the images in terms of water color, turbulence, angle, and illumination. In Table I we report further information about the size of the images from the dataset and the amount of water present in the images.  We evaluate the results of the three algorithms on the test set using two common semantic segmentation metrics, the Figure 1: Qualitative results from the river segmentation, first row corresponds to the original image, the second is the ground truth, the third is the result from the FCN-8s, the fourth from the Tiramisu and in the last row from Pix2Pix.
Mean Intersection over Union (MIoU) and the pixel-wise accuracy (Pa), which are given by: where n ij corresponds to the number of pixels from class i which have been wrongly classified as belonging to class j, n ii represents the pixels from class i which have been correctly classified, C is the total number of classes and finally t i is the total number of pixels of class i. In Table II we report quantitative results over the test set, specifically statistic measures (mean and standard deviation) associated with the two semantic metrics. The best results in terms of both mean IoU and accuracy are obtained by the Tiramisu framework, which obtains over 5% better performance in both metrics than the second best performing algorithm, i.e., Pix2Pix. Moreover, Tiramisu achieves the lowest standard deviation, which suggests that the results are consistent among the different images. In Table III we show the best and worst performing algorithms on individual images considering the previous two metrics. Again, Tiramisu is the algorithm having the highest number of bests and the lowest of worsts. In Figure 1 we show some qualitative visual results: the results from Tiramisu seem to resemble more the ground truth while the FCN-8s algorithm seems to perform worst.   Table III: Tiramisu is the best performing algorithm: around 75% of the bests and less less than 15% of worsts, considering both metrics.

VI. CONCLUSIONS AND FUTURE WORKS
In this work we have studied three state-of-the-art algorithms for semantic segmentation and applied them to water segmentation in rivers. We have introduced a new dataset for water segmentation which we have used to train and test the algorithms. Tiramisu was the best performing algorithm for water segmentation, both quantitatively and qualitatively. This work is particularly relevant for the implementation of an automatic detection of water level increases using cameras in riverbeds in order to improve flood early warnings.