Age Invariant Face Recognition using Convolutional Neural Network

ABSTRACT

AIFR-CNN. Deep learning using CNN has become very popular nowadays. It has the advantage that, it provides feature extraction and classification in a single structure. But, for deciding the details of the CNN architecture, there is no standard rule or any logic. Researchers proposed their methodologies using CNN also; it is focused only on their own architecture not on any specific reason behind using those details. These details include: (a) number of layers in the network, (b) sequence of these layers, (c) dimensions of the filters applied, and (d) number of neurons used etc. Hence, we also proposed our own methodology to design the CNN architecture for AIFR.
The main contributions of this paper are: (a) novel 7-layer CNN architecture for AIFR, and (b) the use of smaller image size of 32х32 pixels to reduce time and space complexity. The rest of the paper is organized as follows. Second section includes the related work done in this area. The next section i.e. third gives complete details of the proposed methodology for age invariant face recognition. It is followed by section four, for experimental details using both standard datasets FGNET and MORPH(Album II). Finally, fifth section presents conclusion.

RELATED WORK
This section presents the related work in this area. Some of the researchers focused their work on face identification or recognition problem and others on face verification problem. This problem is basically categorized in two types: Generative and Discriminative Methods. Generative methods need to develop synthetic images of the person at the required age and then perform matching of those images with given image. Discriminative methods need their own way for feature extraction and classification purpose so that two images of same person are matched.

Generative methods
Recently, the method in [11] presented hierarchical model based on two-level learning with new feature descriptor called as Local Pattern Selection (LPS) for solving the problem of aging face recognition. The method in [12], focused on the role of facial asymmetry in recognizing age-separated face images based on matching-score space (MSS). In [13], authors used minimal set of geometric features for age invariant face recognition. It was based on selected feature points and performance evaluated on FGNET dataset. Park et al. [14] proposed a generic method that consists of a 3-D aging model to improve the face recognition performance. They used pose correction step and separate modeling for shape and texture.

Discriminativem
Gong et al. [15] presented a novel feature descriptor named as maximum entropy feature descriptor (MEFD) to recognize age invariant face images. It is a discriminative feature descriptor. To improve recognition accuracy a new feature-matching framework is also presented as Identity Factor Analysis (IFA). Ali et al. [16] focused on a combination of shape and texture features for age-invariant face recognition. They adopted phase congruency feature for shape and LBP variance for texture feature. Bouchaffra [17] introduced a novel framework to reduce dimensionality and extracting topological features such as shape for age invariant face recognition. It is a combination of Kernelized Radial basis function (KRBF) for dimensionality reduction, construction of α-shape for feature extraction and mixture multinomial distributions for object classification.
Tandon et al. [18] attempted a novel approach using LBP of particular region as ROI for age invariant face recognition. Chi-square measure is used as a dissimilarity measure to calculate the distance between two histograms. Yadav et al. [19] presented a system to improve the results of face recognition across age progression by using bacteria foraging fusion algorithm. It reduces the aging effects by a combination of LBP features of global and local facial regions by using bacteria foraging fusion algorithm. Xiao et al. [20] presented a novel method for face recognition using a combination of texture and shape descriptors, called as Biview face recognition algorithm. For texture feature subspace learning methods are used and graph is constructed for shape topology for face images. Li et al. [21] proposed a discriminative approach for face recognition over aging. In this model, they used Scale-Invariant feature transform (SIFT) and Multi-scale Local Binary Patterns (MLBP) as feature descriptors and multiple LDA-based classifier to generate a decision via fusion rule. Ling et al. [22] proposed a discriminative method for face verification across age progression. In their study, they used Gradient Orientation (GO) and Gradient Orientation Pyramid (GOP) as feature descriptor and Support Vector Machine (SVM) as a classifier.

Using convolutional neural networks (CNN)
Recently CNN have become a very popular technique for Computer Vision applications. Many researchers used CNN for face recognition applications. In [23], a method is proposed using a fusion of 2-D face images and motion history image (MHI) for face recognition based on 7-layer deep learning neural network. In [24], authors presented the novel use of deep learning using CNN for automatic feature extraction for roust face recognition across time lapse. They used VGG-very-deep 16 layer CNN architecture in their experiments. Li et al. [25], proposed a new deep CNN model for age-invariant face verification with 7-layer CNN architecture. Parkhi et al. [26] presented a model for face recognition either from a single image or from a series of faces traced from video. It was 11-layer architecture for face recognition. Sun et al. [27] proposed two very deep neural network architectures for face recognition named as DeepID3 net1 and DeepID3 net2. Half features from DeepId3 net1 and other half from net2 are concatenated into a long feature vector in this method. Hu et al. [28] proposed three CNN architectures and conducted extensive evaluation of CNN-based face recognition system. These architectures are: small (CNN-S), medium (CNN-M) and large (CNN-L). They used LFW dataset for experimentation. Xinhua et al. [29] focused on face recognition problem using CNN on LFW dataset. They used Sobel operator to improve result accuracy. Taigman et al. [30] proposed a 9-layer deep neural network for face verification problem where they used alignment step and representation step to apply a piecewise affine transformation. Yi et al. [31] developed an effective representations for both face identification and verification with deep learning named as DeepID2. Many researchers presented their work on AIFR using various methods as discussed above, but only few studies reported for Age Invariant Face Recognition specially using Convolutional Neural Network.

PROPOSED METHODOLOGY FOR AGE INVARIANT FACE RECOGNITION
This section describes the proposed methodology for age invariant face recognition using Convolutional Neural Networks (AIFR-CNN). This network is designed for the recognition of the person having some aging variations. The overall process contains the same traditional steps: Image Preprocessing, Feature Extraction and Classification. Image preprocessing steps improve the performance of the system. We used three basic preprocessing steps. Feature extraction is the process of capturing the desired feature descriptors using CNN rather than extracting it manually. In this model, we used 7-layer CNN architecture. Classification is required to recognize the identity of the person. This work includes multi-class classification problem. The overall process for AIFR-CNN is shown in Figure 1.

Image preprocessing
Standard datasets may have images of different sizes and illumination. Hence, it may lead to some recognition problems. Image preprocessing helps to keep our dataset in normalized format. It includes detection and cropping of facial portion from the given image. For this purpose, we used popular Viola Jones algorithm for face detection. The next step is to convert the RGB image to gray scale image. Later, images are resized to 32х32 pixels and 64х64 pixels for some experiments. In this work, we have not performed any

Network architecture for feature extraction
After basic preprocessing steps, the next step is to extract features as per our requirements. In our proposed work (AIFR-CNN), for this purpose we used deep learning approach using Convolutional Neural Network (CNN). It has many advantages. First, feature extraction and classification are concerns of CNN itself with single structure. Second, this network extracts deeper 2-D features. Third, it is fully adaptive and invariant to local and geometric changes in the image.
Three types of main layers are there in a CNN: (a) Convolution layer, (b) Pooling layer (Subsampling), and (c) Output layer. Feed-forward structure is used to arrange these layers in the network. Each convolution layer is followed by a pooling layer, whereas last convolution layer is followed by output layer. Convolution and pooling layers are 2-D layers whereas output layer is 1-D layer. Every 2-D layer of a CNN contains several planes. A plane of a 2-D layer consists of 2-D array of neurons. Feature map is the output of a plane. In AIFR-CNN, we proposed a 7-layer architecture that includes 3 convolution layers (C1,C3,C5), 2 pooling layers (P2,P4) and 2 fully connected output layers (F6,F7). The architecture for proposed methodology is shown in Figure 3.

Convolution layer
Each plane of convolution layer is associated with one or more feature maps of earlier layer. Convolution mask is used as an associated connection which is a 2-D weight matrix of adjustable entries. The convolution is computed in each plane between its 2-D inputs and its convolution mask. The outputs of convolution layers are added together with an adjustable scalar, called as bias. Lastly, an activation function is applied to obtain the planes output. This output of each plane is known as feature map. A convolution layer may have one or more maps. Each of these feature maps is connected to exactly one plane of next layer i.e. sub-sampling layer. Each plane in last convolution layer is associated with feature map of exactly one preceding layer. Each plane in the convolution layer produces one scalar output; these outputs from all planes are given to output layer. The purpose of this layer is to extract low-level features such as edges and texture. Feature map of convolution layer is calculated as: Where is the convolution mask, is the bias term, and is the list of planes [32].

Pooling (sub-sampling) layer
The dimensionality of each feature map is reduced by spatial pooling by retaining the most valuable information. It can be of three different types: Max pooling -takes the largest element, Average pooling-takes the average of the elements, and Sum pooling -takes the sum of all the elements. The main function of pooling is to reduce the spatial size of the input representation progressively. It helps to make the input representations smaller and more convenient. A pooling and preceding convolution layers have the same number of planes. This result is then passed through the activation function to produce the outputs. This feature map is connected to one or more planes of the next convolution layer. It makes the output of convolution layer more robust to local distortions. Feature map of sub-sampling layer is calculated as where is matrix obtained by summing all four pixels of a block, is the weight and is the bias term [32].

Output layer (fully connected layer)
In AIFR-CNN, the output layer is constructed from sigmoidal neuron. Generally, the outputs of this layer are the outputs of the network. In the output layer, softmax activation function is used by traditional multi layer perception. Other classifiers like SVM can also be used. These fully connected layers capture the correlations between features of various parts of the face like shape and location of eyes and mouth. The Convolution and Pooling layers in combination are used for feature extraction while fully connected layers are used for classification. The output of sigmoidal neuron is calculated as where is the number of output sigmoidal neurons, is weight from feature map m of the last convolution layer to neuron n of the output layer, and is the bias of neuron n associated with layer L [32].

7-Layer Architecture for AIFR-CNN
In our implementation, we used 7-layer CNN architecture for age invariant face recognition as shown in Figure 4. This network architecture consists of sequence of convolution, sub-sampling and fully connected output layers. The convolution layers use filters of 5х5 whereas sub-sampling with 2х2. The input to this network is 32х32 pixels grayscale image. This image used for performing convolution with a filter of 5х5 pixels in size. Convolution is a linear operation and performs element wise matrix multiplication and addition.
The filtered image of 28х28 pixels is obtained by convolving a 5х5 filter with 32х32 pixels image. It has 6 distinct planes. These 6 planes will generate 6 separate feature maps as output of first convolution layer C1 as 28х28х6 matrix. Layer 2 is sub-sampling layer S2. The pooling operation we used is summing and regions are 2х2 pixels. It results in reduced feature map by a factor of 2 in both dimensions and we obtain 14х14х6 matrix. Next layer is another convolution layer C3, and we applied a filter of same size of 5х5. In this layer, we have 16 distinct planes of 10х10. Layer 4 is again sub-sampling layer S4 with where each output unit is connected to all inputs. F6 contains 84 and F7 contains 10 neurons. The output of last fully connected layer F7 is provided to the classifier. Table 1 shows the details of the CNN architecture used earlier for face recognition problem. Some of them are only for face recognition and verification while some of them are for age invariant face recognition. It shows all the details as size of input image taken, number of layers used in the architecture, datasets used and length of feature vector. From this table it is observed that some work has been done for face recognition, but for age invariant face recognition there is still a lot of scope to improve the performance. Again, variations are there in each architecture in size of image and number of layers in the architecture. So, the size of feature vector varies. And, there is no specific reason mentioned behind selecting these parameters.

Classification techniques
In this work, we use multi-class Support Vector Machine (SVM) as a classifier for the identification of a person over long period. It is a supervised learning algorithm as data labels are available. They are effective in high dimensional spaces. It is also memory efficient and versatile in nature. In another

EXPERIMENTAL DETAILS
In this section, we describe the implementation and experimental details for AIFR-CNN. In our experiments, we used LOPO(Leave One Person Out) scheme for testing, where one person from dataset is kept out for testing. In our earlier experiments, we used 3-fold cross validation that needs to keep 3 images of the same person at different ages in three separate folders. In this approach, instead of keeping one image of the person in testing folder, we kept two images of the same person having a gap of at least 10 years. The reason behind doing so is to avoid repetitive testing using different folders. As all images are in a single testing folder, all persons are considered as different individuals. Remaining images of the person are considered for training, in order to avoid the same person in both the folders. We use Rank-1 recognition rate as a performance evaluation parameter. The experiments are performed on MATLAB 2015a(64-bit) with 2.60 GHz Intel(R) CORE(TM) i-5 CPU and 8 GB of RAM.

Datasets
We use two publicly available datasets FGNET [33] and MORPH (Album II) [34] for AIFR-CNN. Both the datasets contain many images of the same person having variation in age, expressions, illumination and head position. FGNET consists of 1002 images of 82 subjects. It includes 12 images of a person in average. It has age range between 0 to 69 years. MORPH Album II contains more than 55000 images of 13000 subjects. It includes age range between 16 to 77 years. Figure 5 and Figure 6, show some sample images where we can observe the variations in illumination, expression, head position and age in both the datasets.

Experiments on FGNET dataset
In our experiments, we used total 980 images of 82 subjects from FGNET dataset for AIFR-CNN among which 852 images are used for training and 128 for testing. We performed various experiments using this dataset. Firstly, we used images of size 32х32 after performing all preprocessing steps. These images include head pose variations also. We used Rank-1 recognition as a performance measuring parameter. From the results obtained, it is found that 98 images from 128 are truly recognized. It indicates 76.6% Rank-1 recognition. Secondly, we used the same procedure and same number of images but with different image size 64х64. In this experiment 87 images from 128 got correctly recognized. It shows 68.8% Rank-1 recognition accuracy. It may because of the network architecture that is not enough capable for this image size. In the next experiment, we used only straight pose(frontal) images. We eliminated non-frontal images from our In this experiment, we used resized images of 32х32 pixels as it has comparative good performance. In this case, 61.2% Rank-1 recognition is obtained for age invariant face recognition using proposed methodology. In the last experiment, we tested our proposed system without using SVM as a classifier. Here for classification, we used Euclidian Distance with Nearest Neighbor (NN) classification rule. As the image of 32x32 pixels gives better results, we used the same size in this experimentation. Here, we get Rank-1 recognition as 75%. Table 2, Table 3 and Table 4 show this comparative analysis for Rank-1 recognition using FGNET dataset.   Table 4. Comparative Rank-1 Recognition with SVM/NN from FGNET Dataset (All images with variation in head pose) Figure 7 shows correctly recognized results for some sample images using AIFR-CNN on FGNET dataset. First column shows the images used for testing and remaining columns show the images of the same person at different ages available in training folder. We used SVM, the supervised learning algorithm for classification as labels are available, it shows the class label to which test image belongs. It is observed from the results that AIFR-CNN is capable to recognize images of the same person at different ages. Hence, it is one of the good approaches. From Figure 7, we can see FGNET dataset contains many images of the same subject with large age gap and for more age gap more variations are there in the images. Figure 8 shows the Rank-1 results of some images other than the images from standard dataset also. For this, we added our sample images and tested on AIFR-CNN.

Experiments on MORPH (Album II) dataset
For proposed age invariant face recognition using CNN, we used another publicly available dataset, MORPH(Album II). Using this dataset, we performed three experiments. Firstly, we have used 1005 images of 255 subjects with all head poses. Among these images 750 images are used for training and 255 images used for testing. Results show 92.5% Rank-1 recognition using CNN. Secondly, we used 2084 frontal images of 575 subjects. 1509 images for training and 575 images for testing. In this case, we obtain 92.8% Rank-1 recognition. In the last experiment, as in FGNET dataset, we tested performance using CNN with Euclidian Distance and Nearest Neighbor(NN) as classifiers. Here we obtain 91.3% Rank-1 recognition. Table 5, Table 6, Table 7 and Table 8 demonstrate this comparison.    Figure 9 shows some correctly recognized results for AIFR-CNN on MORPH(Album II) dataset. From this figure, we observe that there is no more variation in the age of the person as compared to FGNET dataset. Secondly, MORPH dataset contains less number of images per person.  Figure 10 shows the Cumulative Match Characteristic(CMC) curve of the proposed AIFR-CNN with different methods as mentioned using FGNET and MORPH (Album II) dataset. Figure 11 shows the comparative performance analysis of proposed AIFR-CNN over FGNET and MORPH dataset.

Overall comparative discussions
Here in this section, we compare our proposed methodology with existing state-of-the-arts. Face recognition is very vast area in the field of image processing and pattern recognition. There are various parameters that make it really difficult like variations in head position, facial expressions and aging effects. Many researchers proposed their methodologies for solving the problem of recognizing facial images across aging. It generally includes the steps: face detection, preprocessing, feature extraction and classification. Performance of the system is directly dependent on algorithms used for feature extraction and classification. But, the beauty of Convolution Neural Networks is that, it provides feature extraction and classification in a single structure. Although CNN is a very powerful tool, it makes difficult to decide number of layers, number of neurons, and the size of input image provided to CNN architecture. Unfortunately, there is no way or formula available. Nobody focused on these issues rather they proposed architecture by their own way. We also follow the same process to decide number of layers, their dimensions, and size of the image provided as input to CNN. Figure 12 and Figure 13 show some of the failed Rank-1 retrieval results from FGNET and MORPH (Album II) respectively. First row shows the input images used for testing, second row shows the output of our method i.e. failed to recognize correctly and third row shows the ground truth images available in the gallery. It is seen from the results that, there are more intra-class differences and inter-class similarities in both the datasets. Manually also, it is difficult to identify the persons, as some of them look similar to others.  Figure 12. Some examples of Rank-1 failed retrievals from FGNET dataset. First row shows input images, the second row is the rank-1 results of our method, and the third row is the ground-truth, i.e. correct matched images available Figure 13. Some examples of Rank-1 failed retrievals from MORPH dataset. First row shows input images, the second row is the rank-1 results of our method, and the third row is the ground-truth, i.e. correct matched images available In this study, we used images after preprocessing, of size 32х32. To the best of our knowledge, this image size is used for the first time to recognize facial images across aging. It needs less time for computation and space for a large database. So, compared to other studies, it gives better performance on both the datasets as shown in Table 9. We also performed other experimentations on: (a) all images (Frontal and Non-frontal), (b) Only Frontal images on both datasets, and (c) same architecture with SVM and NN as a classifier. The results are tested on FGNET with 980 images of all 82 subjects, on MORPH (album II) with 1005 (Frontal and Non-frontal) and 2084 (Only Frontal) images.  [17] 48.96 Facial Asymmetry [12] 69.40 Graph based view [35] 64.47 NTCA [17] 83.80 MDL [36] 65.2 HFA [38] 91.14 PCA & WLBP [37] 67.30 MDL [36] 91.8 HFA [38] 69.0 CNN [24] 92.2 Facial Asymmetry [12] 69.51 MEFD [39] 92.26 MEFD [37] 76. From the experimentations, it is observed that, a. CNN is better in Rank-1 recognition than available methods with no complicated preprocessing steps like histogram normalization and head pose correction. b. It gives better performance on 32x32 image size as compared to 64x64 image size. As we increase the image size, it needs more time for execution. c. It gives better result on MORPH dataset over FGNET dataset as it contains less age variant images.
Moreover, FGNET contains large intra-personal differences and MORPH contains less inter-personal similarities. d. It gives better results with SVM, as it is supervised learning algorithm over NN as a classifier. e. There is not much difference in the performance of CNN, if we consider only frontal images and exclude non-frontal images.

CONCLUSION
In this paper, we proposed a novel methodology for age invariant face recognition using Convolutional Neural Network named AIFR-CNN. Experimentation has been performed on two image datasets FGNET and MORPH-II. In this approach, our goal is to provide a simple network by using less number of layers, small image size(32х32) for processing. This system preserved simplicity as no separate algorithm is required for feature extraction. The results have demonstrated that it is better than current stateof-the-arts in Rank-1 recognition on both the datasets. Moreover, no complicated preprocessing steps are used for head pose correction. Resized images of size 32х32 pixels show better results as compared to images of size 64х64 pixels on both datasets. AIFR-CNN with SVM as a final classification stage, shows significant improvement in the performance over AIFR-CNN with NN as a final classification stage.