Image-Based 3D MESH Denoising Through A Block Matching 3D Convolutional Neural Network Filtering Approach

Throughout the years, several works have been proposed for 3D mesh denoising. Nevertheless, despite their reconstruction quality, there are still challenges related to the preservation of the fine surface features. Motivated by the impressive results of image denoising by 3D transform-domain collaborative filtering (CF), we extend it to 3D mesh denoising. CF is also capable of revealing the finest details shared by grouped blocks while preserving at the same time the unique features of each block. A new promising approach suggests unrolling the computational pipeline of CF into a convolutional neural network (CNN) structure increasing significantly the efficiency of this solution. In this paper, we successfully extend and apply this method to 3D meshes making a transition from face normals to pixels. Extensive evaluation studies carried out using a variety of 3D meshes verify that the proposed approach achieves plausible reconstruction outputs and provides very promising results.


INTRODUCTION
Despite the impressive results of recent 3D mesh denoising approaches [1,2,3], algorithms that are used in the area of image processing continue to inspire the area of 3D mesh processing. Motivated by this observation, we focus on converting the problem of mesh to image denoising using very robust approaches that have been used and successfully tested in this area. 3D transform-domain CF (e.g., 3D block matching -BM3D) is an approach that achieves state-of-the-art image denoising performance by providing a 3D estimation that consists of jointly filtered grouped image blocks. CF is a special procedure that deals with 3D groups formed by similar 2D fragments of the image, including the following steps: 3D transformation of a group, shrinkage of the transform spectrum, and inverse 3D transformation. The CF of the grouped blocks, reveals the details which are common between blocks, since for each pixel, we obtain several different estimates that need to be combined. This method has been successfully applied in a large number of applications including image/video denoising [4,5] deblurring, superresolution, and compression. CF approaches process the noisy images by successively extracting reference blocks and by matching blocks that are similar to the reference one, forming a 3D group matrix. After performing a 3D transform to the group formed by the overlapping blocks and attenuating the noise by hard thresholding, an aggregation step is taking place in order to form the estimate of the whole image. Recent approaches [6] suggest substituting the 3D transform-domain processing with traditional CNNs, outperforming the original BM3D approaches in terms of computational efficiency. The BM3D filtering of images and its adoption to meshes have an intuitive formulation, which leads to a simple data-driven method that addresses issues that stem from the tradition from two dimensions to manifolds in three dimensions. The data-driven methods are generally superior to the traditional mesh denoising methods [7,8,9], since they do not require the searching of ideal values per parameter for each model. However, their limitations mainly rely on the large dataset, required for the training process, and the corresponding training complexity. Additionally, in real case scenarios, the testing data have generally a different form from the training data (due to different light conditions, device characteristics, etc.) making the implementation of the CNN process in real applications, to be problematic. A possible solution to this problem is to remove any objective characteristic from the data (i.e., training and testing), in order to be independent of geometry constrains (density, connectivity, etc.,), different quality of the scanner devices, or other external conditions that could affect the captured 3D model. In this paper, we take into account all the aforementioned drawbacks, introducing a novel pipeline for 3D mesh denoising that takes advantage of the benefits of the 3D transform-domain CF, used for image denoising. Our main contributions can be summarized as follows: • The proposed data-driven method gives the flexibility for fully automatic runtime execution, without the need for exhaustive searches of ideal parameters per model. • Our method has no special requirements, related to the form of the 3D mesh, since we create equal-sized images uniformly, encoding the useful geometrical infor-  Fig. 1. Pipeline of the proposed method for image-based 3D mesh denoising using BM3D CNN filtering.
mation of any mesh. • The proposed image-based approach results in a more efficient training process, avoiding the need for large datasets, exploiting efficiently geometrical coherence. • We show how a method, inspired by robust and wellknown approaches used in the area of the image processing, can be also efficiently used in the area of 3D mesh processing.
The rest of this paper is organized as follows: In Section 2, we discuss in detail each step of the proposed method. Section 3 presents the experimental results in comparison with other methods and in Section 4 we draw the conclusions.

IMAGE-BASED 3D MESH DENOISING
The proposed method starts with the training of the CNN. After that, the training model can be used to denoise any other new noisy 3D object. The most vital process is a preprocessing step that creates the appropriate and uniformlyused form of the equal-sized images which efficiently encode the geometrical information of any noisy mesh. Fig. 1 briefly illustrates the pipeline of the proposed schema. We work with triangle meshes M consisting of n vertices v and n f faces f . A noisy meshM can be written as:v = v +ẑ whereẑ i ∈ R 3×1 ∀ i = 1, · · · , n f represents a noisy vector. Each f j face is represented by its corresponding normal n i and centroid point c i . For each unit centroid normal n i = {n x n y n z }, ∀ i = 1, · · · , n f , where |n x |, |n y |, |n z | ∈ [−1, 1] and (n x + n y + n z ) 2 = 1, we correspond it to a RGB pixel p i . However, a pixel of an image can not take negative values, for this reason we normalize the values of the normals in order to be in a range of [0, 1]. Then, we correspond the values of the {R G B} components of each pixel to be equal to {n x n y n z }, correspondingly. In this way, a normal n i and a pixel p i are equivalent but with a different physical meaning.

Creation of Images Based on Normals
We start assuming that for each centroid c i there is a patch (enclosed in yellow border lines as shown in Fig. 2) which is created based on its proximity with the k−1 geometrical nearest neighboring centroids. Then, we estimate the normals of the corresponding faces and we formulate an image with size k × k, where k is an odd number {k = 2q + 1 : q ∈ Z}. In the  Fig. 2. Example of how an image 9 × 9 is created, using the normals of a neighborhood of 9 centroid points.
central cell (k+1)/2 of the first row, we place the "normal-ofinterest" (in red color). The normal n 12 of the closest centroid (i.e., d 1 ) is placed in the first right cell, then the normal n 13 of the next closest centroid (i.e., d 2 ) is placed in the first left cell, and the completion of the row continues, in turn, using the normals of all centroids, where d 1 < d 2 · · · < d k−1 . The next rows are completed with the same way assuming, however, that the central cell represents the normal of the closest centroid of n 1 , increasingly for each row.

Creation of the Dataset Consisting of Ideal Images
Image-guided filtering has been presented in many papers providing very good results [10,11,12]. Additionally, in the area of 3D mesh denoising, guided normals have demonstrated also a good denoising performance in many recent works [13,2]. In this work, we follow the same line of thought but using guided pixels and ideal-selected images, for the filtering step of the mesh denoising.

Estimation of the Ideal-selected Row and Guided Pixel
As we discussed earlier, for each centroid normal n i of the mesh, we estimate an image I i , the first row I 1i of which represent the nearest neighborhood area called as patch. Nevertheless, the same pixel p i may be presented in more than one row/patches since each neighborhood area is created by totally overlapping with the neighboring areas of other normals. In fact, some of these rows may represent better the color of a pixel. For this reason, we collect a set S i = {I 1i1 , I 1i2 , . . . , I 1in p } of n p candidate rows and our main objective is to find which one of the I 1ij ∀ j = 1, . . . , n p , is the ideal representative of the pixel p i [13], in terms of the similarity of the color. For each one of these candidate rows I 1ij , we estimate the corresponding covariance matrix C ij = I T 1ij I 1ij ∈ R 3×3 ∀ i = 1, · · · , n f , ∀ j = 1, · · · , n p and we decompose it C ij = UΛV to its eigenvectors U and eigenvalues Λ = diag(λ 1ij , λ 2ij , λ 3ij ). Then, for the estimation of the ideal-selected row I * 1i , two parameters are investigated: (a) the norm 2 s ij = ||λ 1ij − λ 2ij || 2 of the first 2 eigenvalues and (b) the maximum color difference h ij = max(||p i − p l || 2 ) ∀ p l ∈ I 1ij between the i pixel and the other pixels of the same row. Among all candidates rows, we pick this one with the smallest value of Eq. (1): ∀ i = 1, · · · , n f , ∀ j = 1, · · · , n p . Finally, the guided g i pixel is estimated as the average pixel of this ideal row:

Estimation of Ideal Images Based on Guided Pixels
For the creation of the ideal images, we follow the same format as this one presented in subsection 2.1, but using the guided pixels g i instead of the normals n i . In Fig. 3, we present examples of images created by the RGB representation of: (i) the normals, and (ii) the guided pixels, for different features (i.e., flat area, edge and corner) of Fandisk model affected by different level of Gaussian noise with intensity σ E = σ Emean , where σ is the variance of the Gaussian function, and E mean is the average edge length [2]. As we can observe, the images, created by the guided pixels ( Fig. 3-(ii) (b)-(d)), have a similar presentation with the original, despite the different levels of noise. On the other hand, images that have been created using the centroid normals are very noisy and differ from the original, making difficult to be denoised, even in cases of flat areas. This observation is very important since it shows that it is not necessary to train different models for different noise patterns, as other data-driven methods do, but that a generic model would suffice. This is especially useful in cases where the level or type of noise is not known a-priori.

Basic Characteristics of BM3D
BM3D uses three basic steps: (1) "Block matching" which tries to find groups of similar patches, (2) "3D wavelet shrinkage" which denoises each one of these groups in the 3D wavelet transform domain and (3) "Patch group aggregation" is an averaging procedure in which all the estimated patches return to their original positions reweighed each pixel appearing in many instances since images are created by overlapping [14]. For a more efficient block matching step, we use a feature classification, similar to [13]. More specifically, we classify each i centroid into three different categories of features (i.e., flat area (F), edge (E) and corner (C) ), using a κ-means (κ = 3) clustering, applied to the vector λ i = [λ 1i1 λ 2i1 λ 3i1 ] of its eigenvalues, as described in subsection 2.2.1. To create a more relevant dataset, we rotate all normals of each i patch about an angle θ xi , as shown in Fig. 4, so that the i "normalof-interest" (i.e., lied in the red face), to be equal to the constant vector [1 0 0]. In this way, a more coherent dataset is created, facilitating the learning process. Otherwise, patches with every possible normal direction would need to be provided, resulting in the creation of very large training dataset, increasing also the execution time of the learning step.

Characteristics of the used BM3D CNN
The presented CNN (Fig. 5), similar to [6], represents the second step of the BM3D (i.e., 3D wavelet shrinkage) and consists of two convolution layers and a nonlinear transform layer. More specifically, the first convolution layer is used as the wavelet transform step and the second as the inverse  Fig. 5. Patches of the mesh are converted to images and then they are fed into the CNN for denoising.
wavelet transform step. They are generally presented as: where * indicates multidimensional convolution, I l denotes the group of images for the l ∈ {1, 3} layer and W l is the filter with the weights of the l layer after the concatenate process. The nonlinear transform layer is used as the wavelet shrinkage step. Similar to [6], the radial basis functions (RBFs) is used instead of regular hard thresholding or Rectified Linear Unit (ReLUs), according to:

3D Mesh Reconstruction by Updating Vertices
After the estimation of the denoised pixelsp, which is the central pixel of the first row of the denoised images, we convert the RGB values to the (n x , n y , n z ) values of the corresponding denoised centroid normalsn. Then, we reconstruct the denoised mesh updating its vertices, using the following robust and well-known equation [3]: where a, b represents the inner product of a and b, (t) represents the t-th iteration and the matrix Ψ i represents the firstring area of the vertex v i . Algorithm 1 summarizes the basic steps of the proposed process.

Experimental Setup, Datasets and Metrics
The experiments were carried out using an Intel Core i7-4710 CPU @ 3.60GHz PC with 16 GB of RAM. For the experiments, a variety of 3D meshes are used [15], affected by different levels of noise. For the training process, we use only 5 different models, consisting of ∼ 1.2 million (M) centroid points in total. This means that we have 1.2 M very relevant instances for the training which seem very effective, according to the experimental results. For the evaluation of the reconstructed results, we use the following metrics: (i) the average one-sided Hausdorff distance HD from the denoised mesh to the known ground-truth mesh, (ii) the metric θ representing the average angle difference between the normals of Algorithm 1: BM3D CNN for 3D Mesh Denoising Input : Noisy 3D modelM ∈ R n×3 ; Output: Denoised 3D modelM ∈ R n×3 ; 1 Estimate the n f centroid normals n and convert them to corresponding pixels p; 2 for i = 1, · · · , n f do 3 Estimate the ideal-selected row I * 1i via Eq. (1) and the guided pixel gi according to Eq. (2); 4 Estimate the filtered image I * i based on guided pixels; 5 end 6 Create 3 different types of images, based on the features classification, for a more efficient block matching; 7 Image denoising based on BM3D CNN via Eqs. (3)-(4); 8 Convert denoised pixelsp to denoised centroid normalsn; 9 Reconstruct the denoised 3D meshM according to Eq. (5); the ground truth and the reconstructed model [13] and (iii) heatmap visualization highlighting, in different colors, the difference |M −M| between original M and reconstructed M mesh.

Experimental Results
The quality performance of the proposed technique is compared with other well-known and robust techniques of the literature, namely (a) non-iterative smoothing (NIS) [16], (b) fast and effective (FAE) [3] mesh denoising, (c) bilateral normal filtering (BNF) [17], (d) guided normal filtering (GNF) [2], (e) two-stage graph spectral processing (TSGSP) [13]. We also compare the proposed approach to data-driven methods such as (f) deep autoencoders denoising (DAD) [18], and cascade normal regression [15] in two different approaches: (g) by using only a small dataset of models (CNR) (the exact same dataset used) and (h) (CNR+) by using a larger dataset. Our method is denoted as (OUR) and (GEN OUR) represents our approach but using a general training model, having been trained by different levels of noise at the same time. In Fig.  8, we present the original and three corresponding noisy 3D models which have been affected by three different level of noise (i.e, σ E = 0.3, σ E = 0.4, σ E = 0.5). We provide enlarged details of each reconstructed model, for easier comparison among the methods, and we also provide the metric θ. The benefits of our approach are more apparent when the level of noise is σ E > 0.3. For a lower level of noise, the results are comparable with those of the other methods. We can make similar conclusions observing the heatmaps of Fig.  7, which visualize in different colors the differences between the reconstructed and original mesh, in three different levels of noise. Big differences are represented by red color while small differences are represented with blue. Finally, in Fig.  6, we present plots of Hausdorff distance (HD) error per different level of noise, for the reconstructed models of different approaches. For a fairer comparison between data-driven and other conventional methods, we set fixed values at the parameters, which provide good results in all levels noise, and we do not search for the ideal values per each parameter and model. Our method avoids the over-smoothing of the models and at the same time, takes into account the special characteristics of the features, succeeding to preserve them.

CONCLUSIONS & FUTURE WORK
In this work, we presented an image-based 3D mesh denoising approach using BM3D CNN filtering applied in color images. We take advantage of the stable and robust behavior of BM3D, which has exhaustively tested for image denoising, in combination with the fast and effective behavior of CNNs. To achieve this, we create an appropriate form of images, representing patches of neighboring normals. Experimental analysis verified the correctness of our assumption while comparison with other traditional state-of-the-art methods demonstrates the potential of our approach. The proposed method has many advantages, such as (i) only fixed values of parameters are used, (ii) it requires a relatively small dataset for the training process, (iii) a general model, trained by different levels of noise, can be used for denoising, providing acceptable results. The proposed approach has not been tested under of different noisy patterns (e.g., scanned models affected by real noise) however this extension is on our future plans.  Fig. 7. Heatmaps of (a) NIS [16], (b) FAE [3], (c) BNF [17], (d) GNF [2], (e) TSGSP [13], (f) DAD [18], (g) CNR [15], (h) CNR+, (i) GEN OUR, (j) OUR.  Fig. 8. Denoising results of: (a) NIS [16], (b) FAE [3], (c) BNF [17], (d) GNF [2], (e) TSGSP [13], (f) DAD [18], (g) CNR [15], (h) CNR+, (i) GEN OUR, (j) OUR.