Exploring Deep Learning Image Super-Resolution for Iris Recognition

In this work we test the ability of deep learning methods to provide an end-to-end mapping between low and high resolution images applying it to the iris recognition problem. Here, we propose the use of two deep learning single-image super-resolution approaches: Stacked Auto-Encoders (SAE) and Convolutional Neural Networks (CNN) with the most possible lightweight structure to achieve fast speed, preserve local information and reduce artifacts at the same time. We validate the methods with a database of 1.872 near-infrared iris images with quality assessment and recognition experiments showing the superiority of deep learning approaches over the compared algorithms.


I. INTRODUCTION
Iris recognition technology is considered one of the most accurate and reliable biometric modalities for authentication today mainly due its stability and high degree of freedom in texture [1] [2].Currently, most systems require the user to present their iris for the sensor at a close distance, however currently there is a constant pressure to make that relaxed conditions of acquisitions in such systems could be allowed [3].One of the major problem in these conditions (for example at distance or on the move) is related to the quality of the images which are degraded as well as their resolutions which became low, i.e. the number of pixels in the iris region to allow a good recognition rate is constantly degraded when the resolution decreases as shown in [1].
Currently, several methods have been proposed for example based single-image super-resolution using different approaches as internal patch recurrence [4], regression functions [5] [6] and sparse dictionary methods [7].The application of SR techniques to biometric systems is limited, with most research concentrated on faces [8].In the case of iris, some approaches exist [9] but they use whole images for reconstruction.Recently, a method based on PCA eigen transformation of local patches was proposed [3], where each patch is reconstructed separately, providing better quality and detail, and lower distortions.
The first studies applying deep learning related to superresolution in general were performed for image restoration.For example, fully-connected multilayer perceptrons were used for image denoising [10] and Convolutional Neural Networks (CNN) were applied for natural image denoising [11].
Also, Stacked Auto-Encoders (SAE) were used for examplebased super-resolution as can be seen in [12], where in each layer a non-local self-similarity search with a collaborative local autoencoder is used to suppress the noise and enhance high-frequency texture details of patches.
Robust methods using deep-learning were also implemented to map a model from Low Resolution to High Resolution patches trying to find the best regression functions to this mapping as in [13], [14], [15], [16].Among these several successful examples, the Super-Resolution Convolutional Neural Network (SRCNN) [17] has proved to be a good alternative for an end-to-end approach in super-resolution.
In this work, we explore two typical deep learning approaches: Stacked Auto-encoders and Convolutional Neural Networks to increase the resolution and quality of lowresolution images by simulating long distance acquisition sensors.We use the CASIA-IrisV3-Interval database [18] of NIR images for our experiments to validate the methods.Tests performed both in relation to the quality of the images as well as the iris recognition accuracy were carried out to see if the performance is not degraded significantly in high upscaling factors.

II. METHODOLOGY
The single-image super-resolution methods presented in this paper aim at generating a High Resolution image (HR) from one low resolution input (LR).For this purpose, the image is upscaled using bicubic interpolation to the desired factor, then this image will pass through the deep learning (CNN or SAE) procedure that will try to correct the imperfections and noising to reconstruct the final super-resolution image.
To do this reconstruction it is necessary to learn a mapping function F where, given a LR image Y (upscaled by bicubic interpolation), the goal of the method is to transform Y into an image F (Y ) that is the closest possible to the ground truth HR image X.
For the evaluation of the methods in the CASIA-IrisV3-Interval database, first the images were downscaled through bicubic interpolation for the factors 2 (115x115), 4 (57x57), 8 (29x29) and 16 (15x15) and then re-upscaled through bicubic interpolation to the original size (231x231) to pass trough the deep learning procedure.If the CNN and SAE are trained only with factor 2, to achieve greater factors, the input images have to pass trough the network log 2 (n) times to achieve the desired factor n.For example, in a CNN trained with factor 2, to achieve the factor 8, the input image will first pass trough the CNN in order to achieve the factor 2, then the resultant image will pass again to the CNN to achieve the factor 4 and so on.
In this work we take advantage of a common strategy used in image restoration, which is the extraction of patches and their representations as a series of pre-trained bases (such as PCA, DCT, Haar among other).Such filters are convolved with the image and in the case of this work will be optimized so that the mapping is the best possible.This can be done in one, two, or more layers and in the case of this work are followed by a reconstruction step which the predicted overlapping highresolution patches are averaged to produce the final image.This strategy is used both in the SAE's and CNN's that will be explained in the next subsections.

A. Convolutional Neural Networks
CNN's are formed basically of a series of convolutional layers in the first levels (usually with a subsampling step) followed by one or more fully-connected neural networks similar to the multilayer neural networks [19].
The input of a CNN is a (m × m × d) patch where (m × m) is the dimension of the patch and d the number of channels (depth) of the image.In this work, for the CNN training, patches are extracted from the HR images where m = 33 and d = 1, then the patches are downscaled (depending on the factor chosen for the method) and re-upscaled to the original size both using bicubic interpolation as it can be seen in the Figure 1.
In this work, the implemented CNN has three convolutional layers, where: the first layer consists of 64 filters of size 9x9x1 with stride 1 and padding 0, the second layer with 32 filters of size 1x1x64 with stride 1 and padding 0, and the last layer with 1 filter of size 5x5x32 with stride 1 and padding 0. With all paddings set to zero, the feature maps will decrease in size resulting in a patch of size 21x21.In the test phase, the overlapping patches will be extracted with stride 1 and only the central pixel of the resulting feature map will be used which means that the smaller size of the result feature map will not influence the final result image.
After each convolutional layer a non-linearity (or activation) function is applied to the feature maps mainly to accelerate the convergence of the stochastic gradient algorithm called ReLU rectifier function: f (x) = max(0, x), where x is the neuron input.
For the training with the high-resolution patches with their correspondent low-resolution patches we use the Mean Squared Error (MSE) as the loss function trying to achieved the best PSNR as possible when the CNN is completely trained and the loss minimization is done using stochastic gradient descent with the standard backpropagation method.
In this work we tested three different approaches for the CNN training: • From scratch (CNN FS): When the CNN weights are initialized randomly and trained according to the target image database (in the case of this work: the CASIA

B. Stacked Auto-Enconders
An auto-encoder is, by definition, a simple Neural Network (NN) designed to rebuild its own input in its output layer.For this reason, the number of neurons in the input layer is always the same as the output layer.Successful applications demonstrate that Stacked Auto-Encoders can be a powerful alternative to deep learning [21].
For the Layer-wise pre-training of Stacked Auto-Encoders we use the HR patches downscaled and upscaled again using bicubic interpolation in the same way as for the CNN, however in this case, the matrix is turned to a vector in order to fit in the auto-encoder architecture.These vectors are used for the first auto-encoder as can be seen in Figure 2 that are trained until a threshold is reached.In the second auto-encoder, we use the vector that we got from the hidden layer of the previous trained auto-encoder as input, and proceed in the first autoencoder.The same process is applied to the third layer and so on.Then we use the original images (HR patches) as the targets in the last layer of the output auto-encoder.These targets are used to update the parameter of the deep multilayered neural network (Stacked Auto-Encoders) by means of a supervised error backpropagation algorithm.This process tries to reconstruct the image patch by generalizing the missing pixels with the auto-encoder weights learned from the all images of the training database.
When the training is completed, the auto-encoder is used to propagate all the LR patches upscaled using bicubic intepolation resulting in the reconstructed super-resolution patches in a magnification of 2 (when the training is done with this magnification).To achieve a magnification factor of 4, it is necessary to reinsert the reconstructed super-resolution images to the network in the same way as explained for the CNN approach.
For the experiments we trained four auto-encoders with the empirically chosen configuration: 1089

III. EXPERIMENTAL SETUP
For the experiments we use the CASIA Interval v3 iris database that contains a total of 2.655 NIR images of size 280x320 pixels, from 249 subjects captured with a self developed close-up camera, resulting in 396 different eyes.Manual segmentation annotation of the database is available, which is used as input for our experiments.In the pre-processing step all images are resized via bicubic interpolation in order to have the same sclera radius and are aligned by extracting a square region of 231x231 around the pupil center.All images that do not fit in this requirement (for example when the eye is close to the image border) are discarded.After this, the 1.872 remaining images are used in the experiments.For the deep learning training and tests, the pre-processed dataset is divided into two separated sets: 925 images from the first 116 users for the training and 947 images from the remaining 133 users for the tests (we consider each eye as a different user).This set division by users is important to make sure that the same pattern (in the patches) will not be used both in training and testing steps.
To evaluate the performance of the methods by quality assessment algorithms we use the Peak Signal to Noise Ratio (PSNR) that is the ratio between the peak signal and the power of corrupting noise that affects the fidelity of its representation, the Structural Similarity Index Measure (SSIM) that extracts three separate scores (visual influence, contrast and structural score) combining them to the final score and the Visual Information Fidelity (VIF) that calculates the mutual information between input and the output of the HVS channel when no distortion is present and the mutual information between the input of the distortion channel and the output of the HVS channel for the test signal [22].In these metrics, a high metric score reflects a high quality.For the quality tests, all images from the database were used in high resolution as reference images.We compare our method with bilinear and bicubic interpolation as well as to PCA hallucination of local patches used in [3].
We also conduct recognition experiments using reconstructed images to evaluate the iris recognition performance.In this procedure, first the iris is unwrapped to a normalized rectangle of 20x240 pixels using the Daugman's rubber sheet model [23], then a 1D Log-Gabor (LG) wavelet is applied with a phase binary quantization to 4 levels [24].The comparison between the binary vectors is done by the normalized Hamming Distance [23] where the rotation is accounted for by shifting the grid of the query image in counter-and clock-wise directions, and selecting the lowest distance that corresponds to the best match.We also implemented a SIFT comparator in which SIFT feature points in scale space are extracted from the iris region (without unwrapping) and the comparison is performed based on the texture information around the feature points using the SIFT operator [25].

IV. RESULTS
The results of the quality assessment for the test images and for the normalized iris region (20x240) are shown in Table I and Table II.It can be seen in Table I that the use of the Convolutional Neural Networks outperforms the traditional methods of interpolation (bicubic and bilinear) as well as the eigen-patch hallucination (PCA) method, mainly for small downscaling factors.It also can be noticed that the use of the Fine Tuning strategy improves the results by merging the use of natural and iris images during the CNN training.Also, when the CNN is trained with the same downscaling factor as the tests, the results are also becoming more resilient for lower resolutions.It can also be noticed that, for low resolutions, the quality assessment algorithms present different best results which can make the results interpretation difficult.
In iris recognition verification we consider two scenarios: 1) enrollment samples taken from original HR input images, and query samples taken from reconstructed super-resolution results (Table III) simulating a controlled enrollment scenario (for example, when the user is registered using a HR sensor and make use of the system using a cellphone camera with certain distance); and 2) both enrollment and query samples taken from the reconstructed super-resolution results (Table   IV) simulating a totally uncontrolled scenario (for example, when the user is registered using a cellphone and make use of the system also using a cellphone camera with certain distance).It can be observed that the performance of CNN's are the best for small downscaling factors in both scenarios in general, despite the diversity of good results among the training approaches.Using the Log-gabor comparator the CNN using Fine Tuning and Transfer Learning approach beats the other methods except for the lowest resolution that PCA does best.For the SIFT comparator the CNN's are better but there is no particular winning training approach, in this case, using the downscaling factor of 2 the SAE method present the best result for the scenario 2. It also can be seen that for the SIFT comparator the performances of the Bicubic and Bilinear methods degrade rapidly when the resolution decreases, whereas the CNN methods show high resiliency.
It is interesting to notice that in scenario 1 (Table III), the CNN methods perform better in factor 2 and 4 than using the original images without downscaling which means that it, in terms of recognition, is better to downscale the original image (i.e.apply a blur filter) and apply the deep-learning methods from the sensor before comparison.

V. CONCLUSION
In this work we investigated deep learning single-image super-resolution methods using Stacked Auto-Encoders and Convolutional Neural Networks to increase the resolution of iris images.To address the problem we tested if the end-toend mapping between low and high resolution images can be successful applied using different strategies as transfer learning and fine-tuning to improve the results.
Evaluation performed on a database of near-infrared iris images with different upscaling factors both in the training process and in the tests shows the superiority of the tested methods over the compared methods in terms of quality assessment, with the CNN using Fine Tuning approach presenting the best results on average.When we evaluate the recognition rate by iris comparison experiments, the CNN's in general presented better results but there was no particular CNN approach being the best in all scenarios.We also showed that an uncontrolled scenario (scenario 2 in the EER verification results) is feasible since the deep learning approach in scenario 2 presented better accuracy results than the scenario 1.
Also, it is important to notice that recognition performed is not considerably degraded until image is downscaled by 1/8 or higher factors, allowing to use both query and test images of reduced size which can be an advantage for systems under low storage or data transmission capabilities.
In future work we intend to focus on the Convolutional Neural Network approach trying new methods as the use of recursive layers and investigate the use of other loss functions as perceptual loss functions as well as explore other datasets with different semantic knowledge to perform the fine tuning approach.

Fig. 1 :
Fig. 1: An illustration of the Convolutional Neural Network architecture for Iris Super-Resolution.

Fig. 2 :
Fig. 2: An illustration of the Stacked Auto-Encoder architecture for Iris Super-Resolution.

TABLE I :
Results with different downscaling factors and two different factors (average values on the test dataset).

TABLE II :
Results with different downscaling factors and two different factors for the unwrapped iris region (average values on the test dataset).