Mesh Saliency Detection Using Convolutional Neural Networks

Mesh saliency has been widely considered as the measure of visual importance of certain parts of 3D geometries, distinguishable from their surroundings, with respect to human visual perception. This work is based on the use of convolutional neural networks to extract saliency maps for large and dense 3D scanned models. The network is trained with saliency maps extracted by fusing local and global spectral characteristics. Extensive evaluation studies carried out using various 3D models, include visual perception evaluation in simplification and compression use cases. As a result, they verify the superiority of our approach as compared to other state-of-the-art approaches. Furthermore, these studies indicate that CNN-based saliency extraction method is much faster in large and dense geometries, allowing the application of saliency aware compression and simplification schemes in low-latency and energy-efficient systems.


INTRODUCTION
Visual saliency, investigated by numerous researchers, has been considered as the metric that evaluates the significance of a region or point in space, with respect to human visual perception. Saliency is mainly a stimulus-driven process [1] where the salient part of the scene differs from its neighbouring region [2] due to lack of correlation [3]. Psychophysical studies have shown that low-level salient features are processed in the early visual cortex and extract the most important and basic information of the visual scene [4]. Saliency mapping has been employed in various applications in the area of image [5] and video [6] processing, proving a number of benefits in terms of computational and storage efficiency. Similarly, in the area of geometry processing, 3D mesh saliency, correspond to a measure of regional importance of 3D meshes, was coined by Lee et al. [7]. Several other approaches followed utilizing spectral methods [8,9], curvature-based methods [7,10] multiscale descriptors [11], entropy-based methods [12] and hybrid methods [13] taking into account both geometry and color.
Convolutional neural networks(CNNs) have been also extensively used for the extraction of saliency maps from images [14]. Recent works have used also CNNs to extract saliency maps for 3D representations generated by a multiview setup [15], though the literature on CNN based saliency extraction for other 3D data representations including meshes and point clouds is limited. Motivated by the aforementioned challenges and open issues this work employs CNNs to automatically extract saliency maps from 3D geometries facilitating the processing of very large and dense meshes with lower complexity while enabling parallelization. To the best of our knowledge, this is the first CNN-based approach to extract 3D mesh saliency maps, applied directly on the geometric data, thus contributing to the field of geometric deep learning where the sampling of the latent space is non-uniform. More specifically, the contributions of the proposed approach can be summarized in the following points: i) We employ a method based on fusion of robust principal component analysis and eigenvalues to extract saliency maps of 3D scanned meshes and formulate the training set for the effective training process of a convolutional neural network. ii) We formulate a geometric 3D patch descriptor to facilitate the learning process, significantly reducing the required training dataset. iii) We provide an experimental evaluation of the proposed methods in terms of execution times and visual inspection. iv) We provide use case examples, namely compression and simplification, evaluated with relevant metrics. The rest of this paper is organized as follows: Section 2 focuses on preliminaries and basic assumptions. Section 3 describes in detail the workflow of the proposed approach. Section 4 is dedicated to the experimental evaluation while conclusions are drawn in Section 5.

PRELIMINARIES AND BASIC ASSUMPTIONS
RPCA has been employed [16,17], for the decomposition of a matrix E into a low-rank matrix L and a sparse matrix S by 978-1-7281-1331-9/20/$31.00 c 2020 IEEE solving: arg min L,S L * + λ S 1 , s.t. L + S = E, where L * is the nuclear norm of a matrix L (i.e, i σ i (L) is the sum of the singular values of L). We cope with this convex optimization problem using a very fast approach, as described in [18], according to: which can be solved using alternative minimization: In each (t) iteration, the Eq. (2) is updated with rank = K. We define the condition where u denotes the singular values and is a small threshold. If the condition is satisfied then the rank is increased by one (i.e., K = K + 1) and the Eq.
T , ∀ i = 1, · · · , n. Each f j face is a triangle that can be described by its centroid c j = (vj1+vj2+vj3) /3 and its outward unit normal: where v j1 , v j2 and v j3 are the position of the vertices that define face f j = {v j1 v j2 v j3 }, ∀ j = 1, · · · , n f . The firstring area of a vertex v i is defined as the neighborhood N i in which the vertex v i is connected to other vertices by only one edge (i.e., with topological degree equal to 1). Space filling curves (SFCs) [19] are maps from a 1D interval to a region in n-dimensional space. Intuitively, they can be considered as curves that traverse all points of the ndimensional space, thus inducing an order on those points. SFCs represent a bijective function so that f : Common subsets of R n can be denoted as A widely used SFC curve is the classical Hilbert space-filling curve, visualized in Figure 1. Hilbert curve is defined by the partitioning of interval I into 4 n subintervals of length 4 −n and the square Q into 4 n sub-squares of side 2 −n and can be generated by recursion. A bidirectional mapping is generated between the sub-intervals of I and the sub-squares of Q. Even though the original Hilbert  curve handles only regions of size 2 n × 2 n this is not a problem in our case. Hilbert curve exhibits the highest level of coherency described by the metric defined as the distance along the curve with respect to multi-dimensional Euclidean metric [20]. Voothries [21] concludes that the Hilbert curve is superior to other curves in this respect. We will highlight the importance of this observation in the following sections.

PROPOSED CNN BASED SALIENCY EXTRACTION FOR 3D MESHES
This section presents the pipeline of the proposed approach for the extraction of the 3D saliency mapping. Training data were generated from 3D geometric data and a ground truth saliency mapping, extracted via traditional methods [22]. After the training process, the generated output can be used for the automatic extraction of saliency mapping for any other new 3D model. Figure 2 briefly presents the pipeline of our approach. We start by separating the whole mesh into n f (i.e., equal to the number of centroids) overlapped and equal-sized patches. Then, we follow two different steps for the estimation of the spectral and geometrical saliency. The final result is a combination of these two values. Once the saliency mapping has been estimated, we use it as ground truth for the training of the CNN.

Saliency Analysis
For each face f i of the mesh, we estimate the corresponding patch, represented as a matrix N i ∈ R (k+1)×3 , con-sisting of the k + 1 neighboring centroid normals: N i = [n ci , n ci1 , n ci2 , · · · , n ci k ] T ∀ i = 1, · · · , n f Each matrix N i is used for the estimation of the covariance matrices . Then, R i = UΛU T is decomposed to a matrix U, consisting of the eigenvectors, and a diagonal matrix Λ = diag(λ i1 , λ i2 , λ i3 ), consisting of the corresponding eigenvalues λ ij , ∀ j = 1, 2, 3. Finally, the spectral saliency s 1i of a centroid c i is denoted as the value given by the inverse l 2 -norm of the corresponding eigenvalues: Additionally, we normalize the values in order to be in the range of [0-1], according to: Observing the Eq. (6), we can see that large values of the term λ 2 i1 + λ 2 i2 + λ 2 i3 correspond to small saliency features indicating that the centroid lies in a flat area, while small values correspond to large saliency values, characterizing the specific centroid as a geometric feature. This can be easily justified by the fact that the centroid normal of a face lying in a flat area is represented by one dominant eigenvector, the corresponding eigenvalue of which has a very large value. On the other hand, the centroid normal of a face lying in a corner is represented by three eigenvectors, that correspond to eigenvalues with small but almost equal amplitude.
The geometrical saliency features are estimated by exploiting the sparsity of the centroid normals. For the construction of three matrices E l ∈ R n f ×(k+1) ∀ l ∈ {x, y, z} we use the N i matrices according to: E l = [N 1l ; N 2l ; · · · ; N n f l ; ] containing the centroid normals values of the corresponding coordinate.
Then, we apply the RPCA approach to each one of these matrices, as described in 2, taking advantage of the geometrical coherence between neighbouring guided normals. By the decomposition, the low-rank L x , L y , L z and sparse S x , S y , S z matrices are estimated. However, the estimation of the geometric saliency feature s 2i of the centroid c i requires only the values of the first column of the sparse matrices, according to: where S i1x denotes the scalar value of the i th row, of the 1 st column, of the S x matrix. Finally, in a similar way with the spectral analysis, we estimate the normalized saliencys 2i (Eq. (7)). The saliency mapping is a combination of the normalized spectrals 1 and geometricals 2 saliency features, according to: s ci =s 1i+s2i 2 ∀ i = 1, · · · , n f . The proposed method is robust, even when we assume complex surfaces with different geometrical characteristics, since it ex-  ploits spectral characteristic (i.e., over-sensitivity in the variation of neighboring centroid normals) and geometrical characteristics (i.e., sparsity property of intense features). For the estimation of the saliency value of each vertex we use the following equation, Finally, s is quantized into four classes to facilitate CNN based saliency prediction. For face f i with saliency value s i the corresponding class c i is computed as follows, c i = si /∆ , where · is the floor function and ∆ = 1 /4. For the rest of the paper we will refer to this fusion of geometrical and spectral saliency components as fusion-based approach.

CNN architecture, training and saliency map extraction
The CNN architecture consists of three convolutional layers, each one followed by a max-pooling layer and a ReLU activation function. The dimension of the convolutional kernels is 3 × 3 performing a padded convolution operation. A flatten layer succeeds the convolutional layers followed by three fully connected layers and a softmax activation function. An overview of the CNN architecture is visualized in Figure 4.
The patch descriptor facilitates the training process by constraining the latent space. Each patch P i consists of the face normals of N topologically closest neighbours of face f i arranged by the distance of the face centroid c i to face centroid c j , f j ∈ P i and organized in rings as visualized in Figure 3. Working on the normal space makes the descriptor scale and translation invariant. To deal with rotation invariance we assume a local coordinate system and rotate the patch by angle δ ni around rotation axis a n1 so that n f = 1 N i∈Ni A i ·n ci = c, where c is a user defined vector. Such a transformation would imbue the patch with rotation invariance across local x and y axes, while the face arrangement itself renders the patch descriptor rotation invariant w.r.t local z-axis. The motivation behind rotating each patch towards the same direction is that it allows efficient training with smaller training sets.
As a next step, to allow the convolutional kernels to capture the relation between locally consistent faces, we reshape the patch with a Hilbert curve function. Figures 5 and 6 highlight the different regions a patch captured by a Hilbert curve setting compared to a simple reshape function. Finally, to form the training set we extract patches from 3D scanned meshes and assign a class to each patch employing the saliency map extraction process presented in subsection 3.1

Comparison Between the Traditional and the Deep Learning Approach
Our CNN-based saliency map extraction approach 1 adequately extracts salient features as Figure 7 presents. The first column presents the 3D scanned model and the second corresponds to the saliency map taken for groundtruth. The third column corresponds to Hilbert curve based patch face arrangement while the fourth column to simple reshape function patch face arrangement. Even though qualitative inspection shows that the saliency map is captured in both learningbased methods, the confusion matrices reveal, that Hilbert curve arrangement yields better accuracy. In terms of performance and computational complexity, even though, for smaller meshes, traditional and CNN-based approaches exhibit similar performance, the CNN approach is much faster in larger and models with higher density, as presented in Figure 8. It is important to highlight that for the head model fusion-based method required a patch size of k = 58 faces, for centurion model k = 80 and for stonecorner model k = 150. Patch size varies with model density and feature size. However, the patch size, that the CNN can handle, in similar or better execution times, is many times larger i.e. k = 1024 for the presented deep architecture. Fusion-based saliency mapping was executed in MATLAB while the CNN in Python Tensorflow. For the latter, the training took place  in an NVIDIA GeForce GTX 1080 graphics card with 8GB VRAM and compute capability 6.1. For the testing and evaluation, the saliency map extraction took place in an Intel(R) Core(TM) i7-4790 CPU @ 3.60Hz with 32GB of RAM, while the I/O operations where not taken into account for the execution times evaluation.

Simplification use case
Simplification of 3D meshes is a low-level application that focuses on representing an object using a lower resolution. The main objective of a successful simplification process is to preserve meaningful perceptual information. The most important geometrical information of a 3D object is the high-frequency spatial features which are mostly represented by corners and edges. However, in some special cases of geometrically complex 3D meshes, this binary classification of points, into features and non-features, is not always well-defined since more geometrical categories could be apparent. In Figure 10, we present the reconstruct results of different simplifications scenarios (i.e., ∼ 75%, 85%, and 90%) using two different approaches. More specifically, (i) we use only the features extracted by the fusion-based method, as presented in [23], and (ii) we firstly use our approach for the estimation of the saliency and then, we take different percentages of random points from each on of the four classes giving emphasis to the most salient class. We also provide the Hausdorff distance   (HD) error applied to the triangle faces for easier comparison among the methods.

Compression use case
Compression of 3D meshes is another low-level geometry processing application allowing to reconstruct a mesh given the delta coordinates and a subset of mesh vertices according to the approach presented in [23]. To provide further evidence for the usability of CNN-based saliency map extraction we use the saliency map result to order the delta coordinates as follows D = {d ∈ D|v c = 4}{d ∈ D|v c = 3}{d ∈ D|v c = 2}{d ∈ D|v c = 1}, where D is initial set of delta coordinates and v c = 1 to v c = 4 is the predicted class index with the largest value to correspond to more significant salient features. The anchor points are uniformly selected, while only a portion of D is non-zero. Figure 9 presents a reconstruction accuracy comparison between uniform, fusion-based and CNN-based non-zero delta selection revealing that CNN and fusion-based approaches converge while a uniform selection setting exhibits worse performance. For the evaluation of the efficiency of the compression achieved with CNN based saliency maps we employ mean theta error metric defined as the mean value of θ angle difference between the groundtruth face normal position and the reconstructed.

CONCLUSIONS
In this work, we presented a CNN-based 3D mesh saliency extraction approach, formulating a training set from spectral and geometrical 3D mesh processing. More specifically, we use the outcome of these methods in order to train a CNN model which can be used for the fast and accurate saliency map extraction of very dense 3D scanned models. Our aim is to provide a fast 3D saliency mapping method which could be beneficially used by applications that required the use of very dense 3D models and very fast processing. Extensive evaluation studies carried out, assuming a variety of evaluation scenarios, including heatmap visualization for visual perception, simplification and compression use cases, verify the superiority of our approach as compared to other state-of-the-art approaches.