Neighboring Region Dropout for Hyperspectral Image Classiﬁcation

—Deep neural networks (DNNs) exhibit great performance in the task of hyperspectral image (HSI) classiﬁcation. However, these models are usually overparameterized and require large amounts of training data in order to properly avoid the curse of dimensionality and the variability of spectral signatures, thus suffering from overﬁtting problems when very few training samples are available, due to poor generalization ability in this particular case. The traditional regularization dropout (DO) strategy has been shown to be effective in fully connected DNNs but not in convolutional-based ones. This is mainly due to the way these architectures manage the spatial information. In this letter, we introduce a new approach to improve the generalization of convolutional-based models for HSI classiﬁcation. Speciﬁcally, we develop a neighboring region DO technique that selectively cuts off certain neighboring outputs, creating spatial dropped regions. Our experimental results with two well-known HSIs reveal that the newly proposed method helps to achieve better classiﬁcation accuracy than the traditional DO strategy, with a low computational cost.


Neighboring Region Dropout for Hyperspectral Image Classification
Mercedes E. Paoletti , Student Member, IEEE, Juan M. Haut , Member, IEEE, Javier Plaza , Senior Member, IEEE, and Antonio Plaza , Fellow, IEEE Abstract-Deep neural networks (DNNs) exhibit great performance in the task of hyperspectral image (HSI) classification.However, these models are usually overparameterized and require large amounts of training data in order to properly avoid the curse of dimensionality and the variability of spectral signatures, thus suffering from overfitting problems when very few training samples are available, due to poor generalization ability in this particular case.The traditional regularization dropout (DO) strategy has been shown to be effective in fully connected DNNs but not in convolutional-based ones.This is mainly due to the way these architectures manage the spatial information.In this letter, we introduce a new approach to improve the generalization of convolutional-based models for HSI classification.Specifically, we develop a neighboring region DO technique that selectively cuts off certain neighboring outputs, creating spatial dropped regions.Our experimental results with two well-known HSIs reveal that the newly proposed method helps to achieve better classification accuracy than the traditional DO strategy, with a low computational cost.Index Terms-Convolutional neural networks (CNNs), dropout (DO), hyperspectral images (HSIs), regularization.

I. INTRODUCTION
H YPERSPECTRAL images (HSIs) comprise big cubes of adjacent spectral bands, where each pixel records the electromagnetic interaction between the incident solar radiation and the observed objects in a spectral signature that can be considered unique for each material on the surface of the earth.This allows a detailed characterization of observed areas and enables their successful exploitation on a wide range of applications [1].
The huge amount of information contained in HSI data cubes has been exploited by a large variety of spectral, spatial, and spectral-spatial classification methods, offering models with good performance in the task of understanding those features and relationship contained in the image.
Traditionally, the following three kinds of methods have been established depending on the training procedure.
1) Unsupervised methods do not need to be trained, as they do not use labeled samples to fine-tune the model, being quite popular some clustering method such as k-means [2], linear discriminant analysis (LDA) [3], or probabilistic latent semantic analysis (PLSA) [4].2) Supervised methods split the available data into labeled and unlabeled samples in order to perform the training and the inference steps.Some widely used supervised classifiers are the multinomial logistic regression (MLR) [5] or the support vector machine (SVM) [6].3) Semisupervised methods apply different strategies to include unlabeled data, such as active learning approaches [7], or to expand the training set, using, for instance, generative adversarial networks (GANs) [8].With the release of large and complex HSI data sets, the development of new classification algorithms is required in order to properly interpret the acquired data.Deep learning and convolutional-based approaches have been successfully used for this purpose, reaching excellent performance due to their inherent ability for exploiting different spectral and spatial features through deep and hierarchical architectures made up of stacked feature extractors [9], [10].However, the performance of these methods for HSI classification is bounded by the high dimensionality of the data, the limited number of available labeled samples, and, generally, the low spatial resolution of HSIs, leading to the curse of dimensionality (Hughes effect), overfitting, and data variability problems [1].
Deep neural networks (DNNs), in general, and convolutional neural networks (CNNs) [11], in particular, can be seen as approximators of the form f : X → Y that solve an optimization function subject to a loss expression L. In this sense, supervised DNNs are mapping problems where, given an HSI data set, find the corresponding labels by tuning a parameterized model M(X , θ) = Y, whose parameters θ (distributed among the layers' stack) should minimize the error between the predicted and the expected outputs.Recent works claim that the deeper the M, the better the accuracy that can be achieved [12].This has a direct effect on the number of parameters, imposing severe restrictions on the amount of employed training samples, apart from the data degradation factor that is directly related to the increase of the model's depth.In this regard, several data augmenting and regularization techniques have been explored to avoid these problems.Focusing on the first ones, some works propose to increase the training set by applying slight spectral modifications and spatial transformations to the available data [13], although these approaches are very time-consuming (simpler methods, such as the addition of random noise, do not take into account the spatial characteristics of the image).Haut et al. [14] introduced random occlusion (RO) as a new data augmenting method that drops certain areas on the CNN's inputs, maintaining the spatial consistency between the dropped zones.Although this approach exhibits good performance, it is not robust to network parameters, in the sense that relations between the weights of adjacent layers are not encouraged.On the other hand, regularization methods such as dropout (DO) [15] are able to strengthen the model, enforcing independence between adjacent layer weights by setting to zero some randomly selected neural activations during the training stage.In this context, DO is widely used due to its simplicity and low computational cost [9], being effective particularly on the fully connected layers of the CNN's architecture.However, its performance with convolutional layers is not that impressive, due to its random feature clipping, which does not take into account spatial implications.In fact, the effects of DO on a convolutional's output imitate the traditional "salt&pepper" noise, while feature maps remain spatially correlated.In the end, the extracted information is still propagated to the following layers [16].
In this context, inspired by the original DO mechanism and the RO data augmenting approach, this letter introduces the neighboring region DO (NRDO), a new spatially correlated DO mechanism in which random neighboring kernel's activations are dropped, creating occluded areas on the convolutional layers' output volumes that maintain spatial consistency while avoiding a significant increase in computational complexity [16].The remainder of this letter is organized as follows.Section II describes the proposed method.Section III discusses the performance of the proposed NRDO using two HSIs, demonstrating its accuracy.Section IV concludes with some remarks and hints at plausible future research lines.

II. METHODOLOGY
Let us define X ∈ R n 1 ×n 2 ×n bands as a HSI cube, where n 1 × n 2 are the spatial components and n bands is the number of spectral bands.Each HSI pixel can be represented as a spectral vector x i, j ∈ R n bands .An end-to-end spatial 2-DCNN model normally performs a preprocessing step, encoding the spectral information into one band by using, for instance, principal component analysis (PCA) and extracts, for each 1-D pixel x i, j , a neighboring window p i, j ∈ R d×d×1 of d × d spatial dimensions, with x i, j being the central pixel.During the training stage, pairs of patches and labels {p i, j , y i, j } are used to create the training set, where y i, j ∈ R n classes represents the label of the (i, j )th patch's central pixel in one-hot encoding way.These patches are fed to the 2-DCNN, which applies a hierarchical stack of feature extraction (FE) stages to obtain different levels of data representation, until reaching an abstract representation at the end, that encodes the more descriptive features and internal nonlinear data relationship, which are employed by the final classifier layers to produce a classification output.
Each FE-stage is usually composed by a set of different layers, being the convolutional layer the major responsable for the extraction.Each layer l defines K (l) filters, with k (l) × k (l) neurons each.In this sense, the kernel defined by the lth layer computes the operation over the input with sliding-step s, being overlapped on local areas.At the end, the kernel performs the linear convolution ( * ) between the weights of the neurons W (l) , the input data volume X (l−1) , and the bias b (l) , obtaining an output volume X (l) ∈ R n (l) ×n (l) ×K (l) of K feature maps with n (l) × n (l) extracted features (1) After the FE-stage performed by the convolutional layer, a nonlinear activation function H(•) is applied to the output volume in order to extract the nonlinear features and relationship contained into the volume.In our case, we apply the well-known rectified linear unit (ReLU) function.Also, a downsampling operation (implemented by max or average pooling) is applied in order to reduce the spatial dimensions and summarize the obtained features.
From (1), it can be observed that the outputs of previous layers are refined by the following ones, i.e., the CNN's neurons are, in fact, working in a cooperative way [15], [17].Although this hierarchical mechanism is appealing during the training stage, introduce weak links between the neurons of adjacent layers, and hampering the inference step [9].In this context, traditional DO [15] is applied between the activation and pooling layers as a regularization method to avoid overfitting and provide some independence between adjacent layers' neurons, by setting to zero some randomly selected neural activations.This improves the backpropagation procedure, where neurons should be adjusted in an individual way, instead of establishing trivial dependencies with other neurons.The main motivation behind this approach is to force the layer's neurons to extract more robust and discriminatory features on their own.Mathematically, we can break down (1) in order to focus on the (i, j )th extracted feature of convolutional layer l in its zth filter (with z = {1, . . ., K (l) }), to which a gating 0-1 Bernoulli variable is applied as the DO regularization term δ (l)  i, j , following a probability percentage p (l) which is fixed to the lth layer [18]: Fig. 1 provides a graphical illustration of how the DO regularization method works, using a synthetic feature map given in Fig. 1(a).As it can be observed in Fig. 1(b) and (c), the DO injects random noise to the feature maps in order to disentangle the behavior of adjacent layers' neurons.However, this noise is not structured, which makes it not completely effective in the task of removing semantic information of the feature map, where nearby features still contain related spatial information.
To overcome the limitations of the traditional DO strategy, we propose to inject spatial-structured noise at every feature map by dropping the output of neighboring neural activations, obtaining full-dropped spatial regions on the output feature maps that effectively remove spatial-correlated information.In this sense, the amount of neural activations γ (l) that will be dropped, coupled with the surrounding window's spatial size d (l) (that will be set to zero), must be defined at each layer l.Following the DO method, for each position (i, j ) of the input feature map X (l−1) , our NRDO applies the gating variable δ (l)  i, j , obtained by the Bernoulli distribution with probability γ (l) .In addition, for the zero variables δ (l)  i, j , a spatial square patch centered on (i, j ) is obtained as a zero-mask with dimensions d (l) × d (l) .Finally, this mask is overlapped and applied over the input volume X (l−1) , dropping the corresponding window in all the K filters of the lth layer.However, instead of setting a direct dropping probability, γ (l) is obtained as a correction of the traditional DO percentage p (l) , the dropping window's size d (l) , and the spatial dimension of the obtained feature map X (l) ∈ R n (l) ×n (l) ×K (l) by the lth convolutional layer.In this context, γ (l) is obtained as ( It is recommened that d (l) is not be greater than n (l) .In fact, (3) makes an approximation between the desired amount of dropped data, indicated by the known p (l) , and the dropped neighborhood for each zero variable δ (l) i, j , to make an equitable balance between the pixels and their surrounding windows to be dropped and the desired amount of spatial-structured noise to be injected.
Algorithm 1 gives a general overview of the proposed NRDO method, which is applied between the convolutional and nonlinear activation layers, following the scheme given in (2).An interesting aspect is the computation of the window to be dropped.As some HSI data sets are characterized by their low spatial resolution, our strategy can help in this particular case since the dropped neighborhoods are adapted to the feature map's margins, taking advantage of all the available features as we can observe in Fig. 1(d)-(f), where the dropped neighborhoods have been adjusted to the feature's edges.Moreover, these dropped windows can be also spatially overlapped, as it can be observed in Fig. 1(f), Algorithm 1 NRDO 1: procedure NRDO(X (l) ∈ R n (l) ×n (l) ×K (l) : obtained feature map from l-th layer, p (l) : dropping percentage, d (l) : dropping window's size) 2: (n (l) −d (l) +1) 2 Dropping probability 3: l) , n (l) , K (l) ) Initializing mask 4: for i, j in n (l) do M = Dropped_Window_on_M(i, j, d (l) ) For each zero δ (l)  i, j , a zeroed square window d (l) × d (l) is set on the mask M with the center on the (i, j ) position X(l) = M • X (l)  return X(l) 10: end procedure where two dropped regions that are slightly overlapped can be appreciated.
On the other hand, the application of a rigorous NRDO with a fixed value of p (l) can negatively affect the performance of the network, while the implementation of a soft NRDO may not provide the desired robustness.In order to overcome the limitations, our model is trained with a p (l) whose value increases linearly and progressively through the epochs [16], from zero probability to the maximum indicated value of p (l) , with the goal of progressively adapting the performance, extracting more robust and independent features at each epoch.

A. Experimental Configuration and Data Sets
In order to test the performance of the proposed regularization technique, a deep 2-DCNN model has been implemented for HSI classification.Inspired by previous works in [14], the proposed network is composed of four convolutional layers and two fully connected layers.Focusing on the convolutional layers, the second and third layers implement one of the two available dropping mechanism, DO or NRDO, for comparative purposes.Table I describes the details of the configuration of the network, which has been executed on a hardware environment composed by a sixth-generation Intel Core i7-6700K processor with 8M of Cache and up to 4.20 GHz (four cores/eight way multi-task processing), an ASUS Z170 progamming motherboard, a GPU NVIDIA GeForce GTX 1080 with 8-GB GDDR5X of video memory and 10 Gbps of memory frequency, 40 GB of DDR4 RAM with a serial speed of 2400 MHz and a Toshiba DT01ACA  HDD with 7200RPM and 2 TB of storage capacity.In addition, and in order to efficiently implement the proposed approach, it has been parallelized over the GPU using CUDA language over Pytorch framework.Finally, all the codes and examples presented in this letter are available online. 1 The proposed method has been tested over two widely used HSI data sets.The first one is the AVIRIS's Indian Pines (IP) scene, which has 145 × 145 samples with low spatial resolution of 20mpp and 200 spectral bands in the wavelength range from 0.4 to 2.5 μm.It was captured over an agricultural and forest area and its ground truth is composed of 16 different classes.The second one is the ROSIS's University of Pavia (UP) scene, which contains 610 × 340 samples with higher (1.3 mpp) spatial resolution and 113 spectral bands in the wavelength range from 0.43 to 0.86 μm.It was captured over an urban area and its ground truth is composed of nine different classes.

B. Experimental Results and Discussion
1) Comparison Between Dropout and Neighboring Region Dropout: First, experiment compares the performance of the 2-DCNN model with regularization method, considering the original nonspatially structured DO and the proposed NRDO.Each model has been trained over IP and UP with 1%, 3%, 5%, 10%, 15%, and 20% of randomly selected samples, input patch size of 11 × 11, and different dropping percentages ( p (l) = {20%, 40%, 80%}), fixing the dropping window's size to d (l) = 3 in the case of NRDO.Table II shows the obtained results.Focusing on DO, this strategy is highly beneficial when the scene is spectrally mixed and contains few regular spatial structures (as it is the case with the IP scene).Moreover, the bigger p (l) , the larger the overall accuracy (OA) improvement.However, with the UP scene, the effectiveness of DO is appreciably lower than with the IP scene (in fact, only in the case of p (l) = 80%, the OA values rise by more than 1% point for small training sets), even reducing the overall performance with limited training samples (1% of IP and 3% of UP employing p (l) = 20%).In this sense, the proposed method exhibits a more consistent behavior with both data sets, being able to outperform the results obtained by DO and significantly improving the results obtained by the original 2-DCNN without regularization method and exhibiting a lower standard deviation.The effectiveness of this method is visibly high in IP and UP, in particular, when small training sets are considered, reaching the best OA performances when p (l) = 80%.It must be noted that, since NRDO occludes entire windows, it prevents the model from seeing all the complete features of the input data, forcing the network to look for more robust parameters.
2) Comparison Between Neighboring Region Dropout and Several Classifiers: Second, experiment performs a comparison between the proposal NRDO, with p (l) = 40% and d (l) = 3, and six different classifiers: 1) random forest (RF); 2) SVM with radial basis function; 3) shallow multilayer perceptron (MLP); 4) basic and kernel extreme learning machines (ELM and K-ELM); 5) spectral 1-DCNN; and 6) spatial 2-DCNN.In addition, four 2-DCNN models have been considered: without data augmenting or DO methods (original data), with RO [14], with p (l) = 80% of DO, and with  p (l) = 80% of NRDO.Table III reveals that spatial models are able to greatly outperform spectral methods, reaching 90% and 99% of OA when classifying IP and UP scenes, respectively.Focusing in spatial models, the proposed NRDO is able to reduce the overfitting problem when lower training percentages are employed, achieving the best result in all the experiments.This suggests that neurons are able to learn independently while retaining a spatial context, so the final classification becomes more robust.Fig. 2 shows classification maps obtained by the spatial classifiers.It can be seen that the proposed method is able to correctly classify even the smallest and more complex classes, thus providing a more detailed map.Finally, Fig. 3 shows the evolution of the loss and OA with increasing epochs obtained by the 2-DCNN without any data augmenting/regularization method, and with RO, DO, and NRDO.Looking at the raw model, the loss grows as the epochs increase, indicating a clear overfitting problem.Although the RO decreases the loss faster than DO, it also suffers the overfitting in the final epochs.However, the proposed method is able to achieve lower and stable loss scores than DO and RO.This is also observed in the evolution of OA, where the NRDO enables a better tuning of the result.
IV. CONCLUSION This letter evaluates a spatial-structured regularization technique for HSI data classification, which is based on randomly drop squared-windows of the convolutional-extracted feature maps, retaining the spatial consistency and allowing a strongest, deep and independent learning of the layer's neurons.Obtained results demonstrate that not only the proposed approach efficiently deals with the overfitting problem when low training data are available but is also able to reach a better performance than other compared techniques.Moreover, as the proposal the performance of the convolutional layer, it can be effectively used in more complex models, such as ResNets and DenseNets.Finally, since the approach is not restricted to spatial classifiers, in the future, we plan to incorporate it to spatial-spectral models too.

Fig. 1 .
Fig. 1.Visualization of the original DO and the proposed NRDO performance over a feature map of size 17 × 17.The first row shows (a) original feature map and the feature maps obtained after dropping isolated samples using a DO of (b) p (l) = 20% and (c) p (l) = 40%.The second row shows the feature map obtained after applying NRDO by configuring the dropping percentage and the dropping windows size to (d) p (l) = 20% and d (l) = 3, (e) p (l) = 20% and d (l) = 5, and (f) p (l) = 20% with d (l) = 10.

Fig. 3 .
Fig. 3. Evolution of the (Left) loss and (Right) OA as a function of the number of training epochs when using the 2-DCNN with and without DO and NRDO regularization methods and RO data augmenting method, over IP data set, with 10% of training data and setting p (l) = 80% and d (l) = 3%.

TABLE I ARCHITECTURAL
DETAILS OF THE PROPOSED MODEL

TABLE II COMPARISON
BETWEEN DO AND NRDO, WITH DIFFERENT PERCENTAGES OF p (l) AND SETTING d (l) = 3