Robust3D: a robust 3D face reconstruction application

In the process of reconstructing a historical event such as a rock concert only from video, the reconstruction of faces and expressions of the musicians is obviously important. However, in the process of rebuilding appearance, because of the low quality of the video of the recorded concert, the result of the reconstruction may be far from the real appearance. In this paper, a robust 3D face reconstruction application is described that can be applied to a video recording. The application first uses DeblurGAN program to run anti-ambiguity calculation and removes the ambiguity in the concert video. Then, the super-resolution program is used to enlarge every frame of the concert video by four times, thus making every frame of the video clearer. Finally, the 3D faces are obtained after 3D reconstruction of the processed video frames via the 3DMM_CNN program.


Introduction
In recent years, three-dimensional face modeling and reconstruction technology have attracted more and more attention in the field of computer vision and computer graphics. Previously, many researchers have proposed how to reconstruct three-dimensional shapes from two-dimensional images. However, most algorithms require multiple images or videos to initialize the process of 3D face reconstruction. In many applications, only one image is available. Although some methods only use a single image for 3D face reconstruction, the resulting three-dimensional face is not realistic enough [1]. Therefore, we need an algorithm that can reconstruct a realistic three-dimensional face from a single image. Some researchers have proposed more accurate three-dimensional face algorithm, the most successful of which is based on the three-dimensional morphable model (3DMM) [2][3][4][5][6][7][8][9][10], and the algorithm of reconstructing three-dimensional face model by composite analysis and random optimization of multivariate cost functions. In order to improve the accuracy of facial feature fitting, the whole face is first fitted, and then, the specific areas such as eyes, mouth and nose are fitted. The whole fitting process takes about half a minute. Obviously, because of this long time, the system is difficult to use in practice. 3D deformation model (3DMM) is only used for recognition under limited controlled observation conditions [11][12][13][14][15]. Based on the three-dimensional deformation model, an efficient, robust and accurate fitting algorithm, inverse compositional image alignment (ICIA), is proposed in document [16] to fit two-dimensional images. This method greatly improves the fitting efficiency. Previous work [17] presents a linear shape and texture fitting algorithm. This algorithm is similar to ICIA, and its speed is five times faster than stochastic optimization algorithm. In [18], an efficient method of face reconstruction is described, which combines 2D with 3D. A single front face photograph is used to reconstruct a three-dimensional face model. It requires only general facial expression and normal illumination. According to a personalized 3D face, a realistic virtual face can be obtained under different PIEs (posture, illumination and expression).
The contribution of this paper is to design a robust 3D face reconstruction application based on the characteristics of the live video. Firstly, the application uses DeblurGAN program to run the anti-ambiguity calculation and removes the ambiguity in the concert video. Then, the super-resolution program named VDSR is used to enlarge every & Zhihan Lv lyuzhihan@ub.edu frame of the concert video by four times, which makes every frame of the concert video clearer. Finally, the 3D face is obtained by 3D reconstructing the processed video frames via the 3DMM_CNN program.

Related research
Many previous attempts have been made to estimate the 3D surface of the face that appears in a single image. Before considering these, it is important to mention recent multiimage reconstruction methods using image sets (e.g., [19][20][21][22][23]). Recently, the generative countermeasure network (GAN) has achieved good results in image super-resolution reconstruction and in painting. GAN can retain the rich details of the image and create images that are very similar to the real image. At present, there is no application of 3DMM to face recognition from a video of a concert. One reason is that the face images reconstructed by this method are unstable from an uncertain perspective. 3D simulation is either unstable, resulting in a large difference in the same individual's 3D simulation, or it is too generalized, leading to most of the images being similar. It also explains why some people recently proposed using rough, simple 3D shape approximation as a proxy when rendering faces to new views rather than facial representations [24][25][26][27]. We adopted the method of 3DMM_CNN, which can generate robust 3D face models from arbitrary face images. And we used convolution neural network (CNN) to adjust the face shape and texture of 3DMM according to the input image.
At present, 3D face reconstruction based on multiple face images can generate 3D face models with high accuracy, but a large number of images are needed. However, 3D face reconstruction based on single view appears to be difficult, which can be divided into the following categories.
Firstly, statistical shape representations. For example, the widely used 3DMM method uses many aligned 3D face shapes for 3D face reconstruction. This method cannot generate faces with individual features. Recently, CNN has been used to adjust the face parameters of 3DMM [28]. However, they found that lack of sufficient training data is a major problem in face recognition. Unlike the algorithm presented in this paper, they generate training face images based on the sampling of 3DMM face models. Face images generated by this method are prone to over-fitting problem [29]. Therefore, they can only train a shallow residual network.
Secondly, scene assumption methods. In order to obtain the correct face reconstruction model, one kind of research estimates the scene and angle of the input picture. Some methods use information such as light source, facial reflection and facial symmetry to estimate [30]. However, such estimates do not apply in reality [31].
Thirdly, example-based methods. It adjusts the template 3D face according to the input image [32][33][34]. This method can be used to generate the invisible side of a face in face recognition.
Fourthly, landmark fitting methods. This kind of reconstruction method first detects the facial recognition points [35,36] and then compares the recognition points to the 3D model [37,38].

Implementation of DeblurGAN
Given a blurred image I b , we expect to reconstruct a clear image I 8 . To this end, we constructed a generative countermeasure network and trained a CNN as a generator G h G and a discriminant network D h D [39].

Network architecture
The overall structure is given in Fig. 1.
The structure of generator CNN is given in Fig. 2.
The network structure is similar to that proposed by Johnson in the task of style migration. The author added ''ResOut'' or ''global skip connection.'' What CNN learns The network structure of the discriminator is the same as that of PatchGAN.

Loss function
The loss function uses the sum of ''content loss'' and ''adversarial loss'': In this experiment, k ¼ 100.

Adversarial loss
When training the original GAN (vanilla GAN), the problems of gradient disappearance and mode collapse may be encountered, for which it is very difficult to train. The Wasserstein GAN (WGAN) proposed later uses the Wasserstein-1 distance to make training less difficult. Later, Gulrajani and others proposed adding ''gradient penalty'' item, which further improved the stability of training. WGAN-GP achieves stable training on various GAN structures and hardly needs to adjust the super-parameters. This paper uses WGAN-GP, adversarial loss formula as follows:

Content loss
Content loss is to evaluate the difference between the generated clear image and ground truth. Two commonly used options are L1 (also known as MAE, mean absolute error) loss and L2 (also known as MSE) loss. Recently, ''perceptual loss'' has been proposed, which is essentially a L2 loss, but it calculates the distance between feature map generated by CNN and feature map of ground truth. Definitions are as follows: where / i;j represents the feature map of the output of the jth convolution layer before the first max pooling layer (after activation) after the image is input into VGG19 (pretrained on ImageNet). W i;j H i;j represents the dimension of feature map.

Motion blur generation
Compared with other image-to-image translation tasks, such as super-resolution and stylization, it is difficult to obtain clear-blurred image pairs for training. A common method is to use high-speed camera to shoot the video, get clear images from video frames and synthesize blurred images. Another method is to use a variety of ''blur kernels'' on clear image convolution to obtain synthetic blurred images. DeblurGAN is further expanded on the basis of the existing second method. The proposed method can simulate a more complex ''blur kernel.'' Firstly, DeblurGAN adopts Boracchi and Foi's method of random motion trajectory generation [40], which is generated by Markov random process, and then generates blur kernel by ''sub-pixel interpolation'' of trajectory.

VDSR
VDSR is based on the residual network ResNet [41] proposed by He Kaiming in 2015. ResNet has solved the problem that can not be trained when the network structure is deep, and its performance has been improved. Residual network structure has been applied in a lot of work [42].
As was mentioned by the author in the VDSR paper, the input low-resolution image and the output high-resolution image are similar to each other to a great extent. That is to say, the low-frequency information carried by the lowresolution image is similar to the low-frequency information of the high-resolution image. It takes a lot of time to carry this part in training. In fact, we only need to learn high-resolution image and low-resolution image. The idea of residual network structure is particularly suitable for solving super-resolution problems, which can be said to affect the subsequent in-depth learning of super-resolution methods. VDSR is the most direct and obvious structure of learning residuals. Its network structure is shown in Fig. 3.
VDSR takes the low-resolution image which becomes the target size after interpolation as the input of the network and then adds the image and the residual learned by the network to get the final network output. VDSR has four main contributions: 1. deepening the network structure (20 layers), so that the deeper the network layer, the greater the field of feeling. This paper chooses a convolution core of 3*3, and the network with the depth of D has (2D ? 1)* (2D ? 1) field of perception. 2. With residual learning, the residual image is sparse and most of the values are 0 or smaller, so the convergence speed is fast. VDSR also applies adaptive gradient clipping, which limits the gradient to a certain range, and can speed up the convergence process. 3. VDSR completes 0 operations on the image before each convolution, which ensures that all the feature maps and the final output image are consistent in size, and solves the problem that the image will become smaller and smaller through gradual convolution. Experiments show that the predicted results of the complement 0 operation for boundary pixels can also be improved. 4. VDSR trains images of different multiples together, so that a model trained can solve the problem of super-resolution of different multiples.

Regressing 3DMM parameters with a CNN
We use CNN to adjust the 3DMM face shape parameter [43] according to the input face image. At present unconstrained face and 3D ground truth data sets are too small for training depth neural networks. However, we find three advantages.
Firstly, 3D face can be accurately estimated by multiple images of the same face.
Secondly, there are many data sets of multiple pictures of a single individual at present.
Thirdly, there are currently very effective depth neural networks for face recognition.

Acquisition of training data
We adopted the recently published method of generating multi-image 3DMM [44]. We use this method to generate 3DMM on CASIA WebFace data set. These 3D face models serve as ground truth for training our CNN. Multiimage 3DMM reconstruction consists of two steps: First, 500 K images are selected from CASIA data set to estimate the parameters of 3DMM. Second, the 3DMM generated by different photographs of the same individual is aggregated to obtain a single individual's 3DMM (about 10 K individuals).

2.3.1.1
The 3DMM representation Our system uses the popular Basel face model (BFM), which is the best single view 3D model currently open. The generation model of a face includes two parts: face type and texture. The generating function is: fitting Two different methods were used to match each training picture with 3D MM. For image I, we estimate that a Ã and b Ã represent images similar to image I. The best face feature point detector (CLNF) is used to detect K = 68 face feature points P k 2 R 2 ; k 2 1. . .K and confidence values x. Face feature points are used to initialize the angle of the input face in the 3DMM coordinate system. The angle is expressed as six degrees of freedom: angle r ¼ r a ; r ba ; r ca Â Ã and translation . Then, face shape, texture, angle, light and color are processed.

Multi-image 3DMM fitting
Multi-image 3D MM generation is realized by the facial and texture parameters of 3D MM generated by different images of individual pool. c the confidence values generated for CLNF face feature detection.

Learning to regress pooled 3DMM
For each individual in the data set, there are multiple images and a single pool's 3DMM. We use this data to train the model so that the model can generate similar 3DMM feature vectors according to different pictures of the same individual.
As shown in Figs. 1, 2 and 3, we use 101-layer deep ResNet network for face recognition. The output layer of the neural network is a 198-dimensional 3DMM eigenvector. Then, the pooled 3DMM generated by CASIA image is used as the target value to fine-tune the neural network. We also tried using the VGG-16 structure, which turned out to be slightly worse than the ResNet structure.

The asymmetric Euclidean loss
In our experiments, we found that using Euclidean loss would result in a lack of detail in the output of 3D faces. Therefore, we introduce asymmetric Euclidean loss.
using the element-wise operators: Among them, c is the target pooled 3DMM value, c p is the input value and c 1;2 is the balance value of over-and underestimation errors. In practice, we set k 1 ¼ 1; c 2 ¼ 3 to encourage the model to learn more details.

Conv.D(Residual )
ReLu.D-1 Conv.D-1 ReLu-1 Conv.1 HR HR Fig. 3 VDSR network structure 2 Neural network super-parameters We use SGD optimizer with mini-batch of 144, momentum of 0.9 and L2 weight decay of 0.0005 to train the model. The learning rate is 0.01. When the verification set loss is saturated, we reduce the learning rate until the verification set loss stops decreasing.

Parameter based on 3D-3D recognition
After training, the CNN neural network can convert the input image into 3DMM parameter c p , which is f : I ! c p . We use this parameter c p as face feature for face recognition. The similarity formula of two faces is as follows: In some cases, a single individual has a set of pictures. For example, in the YTF data set, video contains multiple frames of a single individual; in the IJB-A data set, we use multiple data sources (pictures, videos) to simulate the 3DMM of each face frame in each video to get an average 3DMM parameter. Figure 4 is a comparison of face detection between the original image and the deblur image. As can be seen, in the left image, the effect of face detection is not ideal. Only the nose and mouth were detected, but the outlines of eyes, eyebrows and faces were not detected. In the right image, the outline of the face, eyebrows, eyes, mouth, and nose are well detected. This shows that the Deblur program improves the clarity of the blurred image and the recognition rate of image detection. The rendering of the face contour is using OpenPose [45]. Figure 5 is a comparison between the effects of Robust3D and PRNET. Among them, (a1)-(e1) are screenshots of concert videos after Deblur and Super-resolution; (a2)-(e2) are three-dimensional face models generated after PRNET processing; (a3)-(e3) are threedimensional face models generated after the treatment of Robust3D; (a4)-(e4) are the results of facial feature detection with Robust3D. As shown in the figure, the image generated by PRNET is more realistic and similar to the original image, but it is greatly affected by the environment and does not filter out the environmental noise. For example, the right face of the three-dimensional face generated in (a2) has a black spot on it, which is the microphone. In (a3), there is no microphone noise. This is because (a3) using the method of 3DMM, only the features of the face and other effective factors are collected, so the environmental noise is filtered out. Similarly, in (c2) there are environmental noises such as microphones, and in (c3), there are no environmental noises such as microphones. In addition, in (d2), the features of human face such as human eyes are not clear and there is a phenomenon of human eyes missing, which is due to the small face in the original image (d1). In (d3), human eyes and other features are very clear, which is also the advantage of using the 3DMM method.

Conclusion
By training CNN, the input image can be converted to 3DMM parameters and 3D face reconstruction can be performed. The accuracy of the Robust3D method is higher than that of the existing methods. At the same time, the performance is excellent on the live video data set of the concert. The disadvantage of using Robust3D is that there are some differences between the reconstructed face and the real face. In future, we will improve our work by Big data technology [46,47].