In-Domain Inversion for Improved 3D Face Alignment on Asymmetrical Expressions

—Facial landmark detection, often termed as face alignment, is a well-studied research problem in computer vision. Nonetheless, face alignment on asymmetrical expressions has been overlooked in the literature, particularly for unusual gestures observed in individuals with unilateral facial paralysis. In this paper, we explore in-domain inversion in a semi-supervised approach for face alignment and target the detection of 3D landmarks on symmetrical and extremely asymmetrical facial expressions due to paralysis. Our approach first leverages unlabeled face data to synthesize face images, while learning a compressed representation in the latent space. Then, it integrates in-domain inversion in the self-supervised stage, to make the latent space semantically meaningful. This is exploited in the supervised stage by a 2D face landmark detector, trained on labeled data. Finally, we extend the pipeline to 3D face alignment and regress the depth coordinate from the intermediate latent space and the predicted 2D landmarks. We evaluate and compare our method to related work on publicly available datasets, and demonstrate that our approach outperforms the state of the art in the detection of 3D facial landmarks in our newly introduced dataset of facial paralysis, ParFace. Our implementation and dataset are available at https://github.com/jilliam/ParFace .


I. INTRODUCTION
Face alignment aims to register a predefined set of landmarks on a face image and is a key step to other face analysis tasks, such as head pose estimation [21], face synthesis [102], reconstruction [77], animation [22] and palsy assessment [42].Many of these landmarks are semantically meaningful, referring, e.g., to the corners of the eyes and lips, the tip of the nose and the contours of the eyebrows.
In the past years, many researchers strove to unify and standardize the set of keypoints used for face alignment [20], [74], [75], [76], [97].The most common set defines 68 fiducial points on the eyes, nose, lips, eyebrows and around the boundary of the face, following the convention proposed in Multi-PIE [31].This number differs for profile faces, where 39 fiducial points are annotated instead.These landmarks, referred to as 2D facial landmarks, are defined around the face contour and do not always correspond to the projection of 3D landmarks onto a 2D image, specifically when the face is not frontal [45].Although this convention is useful for tasks such as face segmentation, it is error prone for optimization problems, e.g., when minimizing the reprojection error [12], [49].3D Morphable Models (3DMMs) [8] and deep architectures have enabled the collection of datasets Fig. 1.Our approach learns to synthesize faces from unlabeled datasets and exploits the latent code to predict the landmarks.
With the introduction of large-scale datasets [89], [97], [107] for training deep neural networks (DNN), 2D face alignment gained a performance boost w.r.t.traditional computer vision approaches, especially for challenging images with varying illumination, large head poses and occlusion.These datasets, however, have relatively few samples of large asymmetrical expressions and even less of peripheral facial paralysis, or palsy, affecting current face alignment approaches (see Fig. 2).This limitation has a negative impact on palsy assessments that rely on face alignment [2], [34].Such assessments usually require the patient to follow predefined facial expressions, e.g.raising the eyebrows, closing the eyes and smiling.Then, an asymmetry index is computed based on measurements between specific areas in the affected side w.r.t. the unaffected side or the face at rest.An automatic method for extracting features or parts of the face used in the evaluations would reduce the associated costs and observer dependence inherent to manual assessment [37], [55].In addition, 2D-landmark-based palsy assessment requires fully frontal face images [34] or pose correction techniques [37], [71], while the assessment with 3D landmarks is less prone to measurement errors since distances are not affected by the face orientation.
In this work, we aim to detect 3D facial landmarks and 979-8-3503-9494-8/24/$31.00 ©2024 IEEE Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Image DECA [25] FAN [11] JVCR [99] 3DDFA 2 [36] Fig. 2. Face alignment on patient with palsy.Top row: landmarks extracted from SOTA architectures.Bottom row: Close-up of the landmarks in the mouth.Note that the landmarks are defined around the contours of the lips.
address the limitation of current approaches, to target cases with large asymmetrical facial expressions from patients with facial palsy, alongside healthy subjects.Our approach exploits unlabeled face data with and without facial paralysis, to train an autoencoder and create an intermediate representation in a latent vector.In this stage, an in-domaininversion module is incorporated to ensure a smooth latent space and enhance the representation of the expressions.In the supervised stage, we integrate interleaved transfer layers to the decoder to regress 3DA-2D landmarks, inspired by the state of the art (SOTA) 2D face alignment method, 3FabRec [10].Our approach additionally enables the detection of 3D landmarks by means of a newly proposed 3D landmark detector.By relying on unlabeled data, our approach seeks to alleviate the cumbersome landmark annotation task, particularly for clinical data.
The proposed approach is supported by multiple experiments and evaluation on public face alignment datasets, in addition to a newly introduced facial palsy dataset.
The main contributions of this work are: • A novel approach for 3D face alignment which encompasses cases with large facial asymmetry (see Fig. 1).
• The novel integration of in-domain GAN inversion in the self-supervised stage, to enhance the detection of the facial landmarks.
• ParFace, a 3D face alignment dataset on patients with palsy.ParFace and our source code are publicly available for research purposes.
• Evaluation on public face alignment datasets, in addition to the proposed facial palsy dataset, with improvements w.r.t. the state of the art in 3D face alignment.

II. RELATED WORK
Face alignment has been widely studied in the computer vision community.We classify face alignment approaches based on the type of landmarks: 2D and 3D.
Face alignment for palsy assessment can be divided in two categories, 2D and 3D landmarks-based.3D-based methods usually compute the landmarks from multi-camera systems [40], [101] and 3D sensors such as Kinect [26], deterring their implementation in a clinical setting.2D landmarks, on the other hand, are extracted from grayscale or RGB images, captured from easily accessible cameras in smartphones [55], web [37] or digital cameras [29].

III. METHOD
In this work, we explore the semantically meaningful latent space in a reconstruction-based architecture, to improve the detection of facial landmarks in faces with a varying range of expressions.The proposed semi-supervised architecture ParFace-Net is shown in Fig. 3.In the self-supervised stage, an autoencoder (AE) is trained with unlabeled face datasets, where the encoder E learns the mapping from the input data to a low-dimensional intermediate vector z.This latent code is further enforced to be semantically meaningful, through the feature disentanglement process introduced by in-domain inversion.This is achieved by means of the discriminator D and adversarial training on the E and D, while the decoder G is frozen.In the supervised stage, a 2D landmark detector learns to regress 3DA-2D landmark heatmaps from the semantically rich latent code, which in turn are used to predict the depth coordinate.We further fine-tune the encoder with the gradients from the landmark heads to improve the results.

A. Self-Supervised Stage
This stage consists of an adversarial AE, trained on largescale face datasets.The encoder E learns to capture the most important facial attributes in an intermediate latent vector z, while the decoder G is posed as a generator of a GAN that reconstructs the original image from the latent code.The AE is trained on a combination of three losses, as follows: where L rec is the reconstruction loss, given by the L1 or L2 pixel-wise distance between the input x and the reconstruction x at G(E(x)); L perc is the perceptual loss [48], in (2); L adv is an adversarial image loss [28], which enforces the AE to produce realistic faces based on the output from D; and λ (•) is the respective weight of each loss.
C i , H i , and W i are the depth, height and width of the feature map V i (•) at layer i of a VGG network [81]; x and x are the input and reconstructed images; and ϕ is the set of layers from VGG.
We introduce a discriminator D during the face reconstruction phase, trained with the Wasserstein Loss with Gradient Penalty (WGAN-GP) [35], formulated as where the last term is the gradient regularization and the hyper-parameter γ = 10.The AE and D are trained using a similar procedure to GANs, alternating gradient updates.

B. In-Domain Inversion
We leverage the generative capabilities of the AE by incorporating in-domain GAN inversion.Inspired by Zhu et al. [105], we follow a domain-regularized approach that pushes the encoder to create latent code in the semantic domain.In [105], this module enables semantic editing of facial attributes such as expression and pose, while an additional optimization stage improves the reconstructed face in the pixel level.Unlike [105], our approach does not seek to edit facial attributes nor aims to create a faithful reconstruction of the face.Instead, we propose to encode facial attributes in the latent vector that boost the alignment in the landmark detectors for a wide range of expressions.
The inversion is achieved in [105] by introducing a domain-guided encoder to the GANs-based formulation.We instead exploit the pre-trained encoder E from the previous step, as shown in red in Fig. 3.The discriminator D is then used to compete with E, which acts as the domain-guided encoder and refines the latent space z to be aligned with the semantic latent space of the reconstruction process.During this stage, the decoder G is fixed, and E and D take turns to train with the loss functions in (1) and (3), respectively.To that end, the same unlabeled data as in the self-supervised stage is used, where E is fed the input image x, and the input of D is given by x and the reconstruction x.
The asymmetrical features in the latent code from palsy patients are refined in this stage, without affecting the reconstruction of symmetrical faces.Hence, the same trained model could be used to align the landmarks during different levels of palsy, to continuously track the recovery.
In contrast to [105], we do not apply the final optimization step to enhance the output of the reconstruction, since we do not aim to create an accurate reconstruction in the pixel level.

C. Supervised Stage
The supervised stage is composed of a 3DA-2D and a 3D landmark detector, where all the face information learned in the self-supervised stage and refined through the in-domain inversion module is available for generalized usage across various landmark datasets.
1) 3DA-2D Landmark Detector: In this stage, the landmark detector maps the disentangled latent code z to 2D heatmaps that represent the probability map of each landmark location.During training, the parameters of the autoencoder are fixed and the layers of the decoder G are interleaved with 3 × 3 convolutional layers, inspired by 3FabRec [10].The last convolutional layer that produces the face image is then superseded by a convolutional layer that provides the heatmaps, as shown in Fig. 3.
We propose to adopt the adaptive wing loss (AWing) [87] as the heatmap prediction loss, instead of the mean squared error (MSE) from [10].Since background pixels on a heatmap dominate over foreground pixels, this loss function penalizes small errors on foreground pixels while tolerating small errors on background pixels.It is formulated as where h and ĥ denote the ground truth and predicted heatmap pixel values, and ω, θ, α, and ϵ are positive values.A and C are added to smooth the loss function at |h − ĥ|= θ.
2) 3D Landmark Detector: We introduce a 3D landmark detector to regress the depth coordinate of the 3DA-2D landmarks.It takes as input the concatenation of the intermediate latent vector and the predicted 3DA-2D landmark heatmaps.
3) Encoder Fine-tuning.:This strategy proposes to further optimize the encoder E along with the Interleaved Transfer Layers (ITL) in tandem [10].The fine-tuning encourages the encoder to embed more features in the latent code that enhance the landmark predictions.By integrating this step, the identity of the reconstructed face no longer resembles the original image and the reconstruction tends towards an average face, as shown in Fig. 1.Nonetheless, other attributes such as the expression and pose are enhanced.

IV. EXPERIMENTS AND RESULTS
ParFace-Net was implemented in Python using PyTorch.The AE was trained on a Nvidia A100, while the face alignment networks were trained on a Nvidia RTX2080-Ti.

A. Datasets
ParFace-Net is trained on well-known public datasets on face analysis.Table I lists the datasets and in which stage they were used.In the self-supervised stage and during the in-domain inversion, the AE is trained with multiple datasets, without any type of landmark annotation.We introduced palsy datasets in these stages, such as the Toronto NeuroFace, MEEI and the unlabeled set of ParFace.The 3DA-2D and 3D landmark detectors are trained with 300W-LP.We separately train the 3DA-2D detector with 2D landmarks, to investigate the performance on 2D face alignment.These results are reported in the Supplementary Material.Palsy Dataset.We introduce ParFace, a new dataset on palsy face alignment with 3D landmarks annotations in video sequences.We collected 28 videos from YouTube of 150 frames each, where the subjects are usually talking to the camera or making a wide range of facial expressions.The videos have varying resolution and cover a wide range of ages, ethnicity, poses, illumination settings and backgrounds.We provide 68 landmarks annotations for 1350 frames in 9 videos, for a total of ∼92K annotations.
We developed an annotation tool, which provides an initial 3D landmark estimation by 3D-FAN [11].Since 3D-FAN was trained on datasets without palsy, each landmark was manually refined to match the asymmetrical facial expressions and provide high quality annotations.This refinement affected most of the 3DA-2D landmarks, and less the depth coordinate.Some sample images are shown in Figure 5 and in the Supplementary Material.
The annotated set of ParFace can be used as a benchmark to evaluate palsy alignment, as in Section IV-E, or to finetune semi-or fully supervised approaches as in Section IV-G.The unlabeled set of ParFace can be used for training semior self-supervised architectures, similarly to ParFace-Net.

B. Implementation Details
The AE takes as input a cropped version of the face.For labeled datasets, we use the ground truth landmarks to compute the bounding box, following related works.Otherwise, we use the MTCNN face detector [100].Faces with a height less than 100px are discarded.The data is augmented with random horizontal flipping (50%), translation (±4%), scale jittering (94% to 103%) and rotation (between ±45 • ).
2) Training Details: We use the Adam optimizer [56] with a learning rate of 2e-5, β 1 = 0.0 and β 2 = 0.999.The autoencoder is trained with input and output images of size 256 × 256.We train for 50 epochs with (1), where L rec is the L2 loss, followed by 50 epochs with the L1 loss as L rec .After that, we fix the decoder G and optimize the encoder E for feature disentanglement against the discriminator D with the L2 loss as the L rec , for 50 epochs.
The 3DA-2D landmark detector is trained for 100 epochs to predict the heatmaps.We fine-tune the encoder with gradients from the landmark head for 100 epochs.A similar procedure is followed in the experiments to train the 2D landmark detector.For 3D face alignment, the 3D landmark detector is trained with the ground truth 3DA-2D landmark heatmaps for 50 epochs.

C. Evaluation Metrics
Following the standard protocol, we adopt the normalized mean error (NME) to evaluate 3DA-2D face alignment on ALFW2000-3D and ParFace.We additionally report the failure rate (FR) and area under the curve (AUC) at 10% of the Cumulative Error Distribution (CED) on ParFace.3D face alignment is evaluated using the ground truth error (GTE) on AFLW2000-3D and ParFace.The GTE is equivalent to the NME, but evaluates the full 3D coordinates.The GTE is normalized by the inter-ocular (IO) distance, while the NME is normalized by the square-root of the bounding box size enclosing the landmarks, following related works.We report the standard deviations σ of the NME and GTE in ParFace.

D. Evaluation on AFLW2000-3D
3DA-2D and 3D face alignment are evaluated on the widely used benchmark AFLW2000-3D, where the landmark detectors are trained using 300W-LP.We employed the AE with in-domain inversion to train the landmark detectors.Furthermore, we refined the 3DA-2D landmarks with encoder fine-tuning.The results are shown in Table II.
We observed that our models outperform the SOTA in 3DA-2D face alignment (NME) for frontal and near frontal faces (0 to 30 • ) and our ParFace-Net with the AWing loss has the 2nd best GTE on 3D face alignment for the reported methods.For larger poses, we noticed a decreased performance of ParFace-Net.This could be attributed to the small portion of non-frontal faces in the self-supervised stage, where face semantics are learned mostly for near-frontal poses.We also observed that model-based approaches tend to be more robust to large head poses, since they are trained with additional 3DMM parameters such as head orientation and face shape.However, as shown in the next section, model-based methods do not cope well with a wide range of facial expressions, including asymmetrical expressions.

E. Evaluation on ParFace
The annotated set of ParFace is employed to evaluate our models from Section IV-D.Note that they were trained on 300W-LP and without annotated palsy data.We additionally report the performance of different SOTA model-based and model-free methods for 3D face alignment, which have also been trained on 300W-LP or related 3DMMs datasets.To discard alignment errors due to face detection inaccuracies, we replaced the face detectors in every method and provided the bounding boxes from the ground-truth landmarks to crop the input images.The results are shown in Table III.The CED curves for the normalized 3DA-2D and 3D RMSE are shown in Figure 4. Our models achieve the lowest NME and FR, the highest AUC and the 2nd and 3rd lowest GTE on ParFace.As mentioned in Section IV-A, for labelling ParFace, an initial landmark prediction was computed using 3D-FAN.While the 3DA-2D landmarks were heavily refined, the z coordinates were refined to a lesser extent.As expected, 3D-FAN has the lowest GTE in this dataset.Qualitative results are shown in Figures 5 and 6.Table III and Figure 5 show that model-free methods have in general a better performance and are more flexible on asymmetrical expressions than model-based pipelines.

F. Runtime and Model Parameters
We measured for ParFace-Net a runtime of ∼230FPS on average for 1K repetitions, for 2D and 3DA-2D face alignment, on a Nvidia RTX2080-Ti.To estimate the full 3D coordinates, ParFace-Net runs at ∼156FPS.
ParFace-Net is composed of two ResNet-18 and an inverted ResNet-18, with a total of ∼24.14M parameters.3D-FAN is composed of 4 HG networks with ∼24M parameters and a ResNet-152 with ∼58.5M parameters to compute the depth coordinate (in total ∼82.5M).Likewise, JVCR uses 4 stacked HG and an additional network to map the voxels to coordinates, with 32.47M parameters in total.SynergyNet has 4.6M parameters, 3DDFA V2 3.27M and DECA uses two ResNet-50 with more than 25M parameters each and multiple decoders to retrieve the parameters of the 3DMM.

G. Ablation Study
We evaluate the contribution of each module in the face alignment process.
1) Training the Self-Supervised Stage: The impact of the self-supervised stage in the landmark detection task is analyzed.For that purpose, we trained the landmark detectors omitting the self-supervised stage and only the encoder is pre-trained on ImageNet.Since the latent code does not encode face information, the in-domain inversion is not applied either.The encoder is later fine-tuned after training the 3DA-2D landmark detector, as detailed in Section III-C.The results are shown in Table IV, without check marks in the categories 'Self-Supervision' and 'In-Domain Inversion'.For every metric, there is a large decline in the performance when the self-supervised stage is omitted.

2) Training with In-Domain Inversion:
We investigated the effect of the in-domain inversion module as well.To that end, we trained the landmark detectors before and after the in-domain inversion is applied.The results for 3DA-2D and 3D face alignment are shown in Table IV.We observed that in-domain inversion boosted the performance in every metric w.r.t. the model without inversion.The improvement is more noticeable for ParFace, both in the NME and GTE.
3) Effect of AWing Loss: We additionally examined the performance of the 3DA-2D landmark detector using the MSE loss and the proposed AWing loss.The results are reported in Table IV.During the experiments, we observed that the MSE loss converged faster, but in overall the AWing loss leads to improved accuracy in most of the metrics.We hypothesize that this is due to AWing loss being more sensitive to foreground pixels than background pixels, considering that background pixels predominate in the heatmaps.

4) Training the AE with Portions of the Data:
As an additional ablation study, we explore how the performance of the landmark detectors are affected when the self-supervised stage is trained with different portions of the data.The quantitative results for AFLW2000-3D and ParFace are reported in Table V.To that end, we trained the AE with multiple combinations of the datasets from Table I, where the total amounts to ∼590K images.Note that only the models with 3% and 100% included palsy data, and that all the models Ground truth 3D-FAN [11] JVCR [99] 3DDFA V2 [36] DECA [25] SynergyNet [88] PF-NetMSE PF-NetAW ing   were trained with in domain inversion, the AWing loss and encoder fine-tuning.As part of the experiments, we trained the AE only with palsy datasets at our disposal: Toronto NeuroFace, MEEI and the unlabeled set of ParFace.The results correspond to 3% of the total data in Table V.From this experiment, we observed a comparable performance on ParFace with the model trained with 1%, with a minimal improvement on the model trained with palsy data.However, on AFLW2000-3D, the model trained with palsy data showed in general a slightly lower performance than the model trained with only 1% of the data.The main reason is that the Toronto NeuroFace and MEEI are clinical datasets collected in controlled conditions, with little diversity in terms of pose, lighting and background setting.Therefore, a model trained with relatively few images in the wild (in this case from ParFace), would not perform well on challenging images with large poses, occlusion and varying lighting, such as in AFLW2000-3D, due to insufficient data to generate a compact face representation embedded in the latent code.
Overall, the alignment performance shows a gradual improvement as more data is added to the self-supervised stage.These results lead to the assumption that the landmark detectors can be further enhanced as more unlabeled data with large diversity is used for training the AE.

5) Training with Labeled Palsy Data:
The results in Section IV-E were computed from models that were not trained using labeled data from ParFace.To evaluate the influence of labeled palsy data in our approach, we split the dataset into a training and test set and fine-tune the previously trained models with a portion of the data.The results are shown in Table VI.We use 6 sequences for training and 3 for testing.We split the training set into 6 parts, each containing N number of sequences, where N is in the range [1,6].The number of sequences used are added as numerals in the    without palsy data, to evaluate the test set of ParFace (450 images in total).The results show an overall improvement when more data is added to fine-tune the models.By adding 6 * 150 = 900 training images with palsy data to a model trained with more than 61K labeled images (representing less than 2%), we obtained performance gains of around 20% in the NME and more than 10% in the GTE.
These experiments validate the use of ParFace to fine-tune semi-or fully supervised DNN for 3DA-2D and 3D face alignment on asymmetrical facial expressions.

H. Discussion and Limitations
Similarly to most heatmap-based methods for face alignment, our approach fails under extreme occlusions, as shown in Figure 7.Other failure cases occur when the face synthesis fails due to unusual facial expressions, large head poses, lighting and low contrast.3DMMs-based methods are more robust in such cases by keeping the spatial structure of the landmarks, even if the face is not properly aligned.However, they are not able to align the landmarks correctly for unseen faces, such as in asymmetrical facial expressions (see DECA in Fig. 5).
We noticed that our method heavily depends on the training data in the self-supervised stage.By using datasets with less diversity in terms of pose, expression, occlusion and illumination, the performance of the landmark detectors drops.The dependency on unlabeled data is not to be seen as a drawback, since the generation of such datasets is much less expensive than labeled data.We also observed that using a small set of unlabeled palsy faces (∼3% of the total amount, see Table I) to train the AE enabled the indomain inversion module to encode asymmetrical features in the latent vector, improving the landmark detection.
Dedicated architectures for HQ face reconstruction such as StyleGAN2 [52] could replace the inverted ResNet-18, as in [24], to improve the reconstruction.This comes at a cost of increased complexity and trainable parameters in the pipeline.While StyleGAN2 has ∼28M parameters and a computational complexity of 143.15 Giga Multiply-Accumulate Operations (GMACs), ResNet-18 has ∼11M parameters and a complexity of 1.82GMAC.

V. CONCLUSIONS
This work introduced a pipeline for 3D face alignment, targeted to faces with symmetrical and asymmetrical expressions.We propose a semi-supervised architecture which exploits large unlabeled datasets and integrates face alignment with smaller labeled datasets.We explore the latent space in the self-supervised stage, and optimize the encoder to produce a disentangled latent space with in-domain inversion.Our landmark detector uses the AWing loss to regress 3DA-2D landmark heatmaps and a newly introduced separate branch computes the depth of the 3D landmarks.A future direction would be to exploit additional 3DMMs parameters, to enable the autoencoder to learn the pose, expression, and shape from 2D images under large head poses and extreme occlusion.

Fig. 3 .
Fig.3.Architecture of ParFace-Net (PF-Net).Our pipeline consists of a self-supervised stage to train an autoencoder, where the latent code z is disentangled via in-domain inversion.In the supervised stage, z is leveraged by the landmark detector to retrieve 3DA-2D and 3D landmarks from dedicated networks.

Fig. 6 .
Fig. 6. 3D face alignment on ParFace with different methods of the SOTA.Ground truth in red and predictions in green.

TABLE I PUBLICLY
AVAILABLE DATASETS USED FOR TRAINING THE AUTOENCODER (AE), THE 2D AND 3D LANDMARK DETECTORS, AND FOR TESTING THE CURRENT MODEL.

TABLE III EVALUATION
ON PARFACE.THE NME, AUC AND FR EVALUATE 3DA-2D LANDMARKS, WHILE GTE EVALUATES 3D ALIGNMENT.

TABLE V ABLATION
STUDY ON 3DA-2D AND 3D FACE ALIGNMENT AFTER TRAINING THE SELF-SUPERVISED STAGE WITH PORTIONS OF THE DATA.THE AE IN * WAS TRAINED ONLY WITH PALSY DATA.