A comparative analysis of automatic deep neural networks for image retrieval

Feature descriptor and similarity measures are the two core components in content- based image retrieval and crucial issues due to “ semantic gap” between human conceptual meaning and a machine low-level feature. Recently, deep learning techniques have shown a great interest in image recognition especially in extracting features information about the images. In this paper, we investigated, compared, and evaluated different deep convolutional neural networks and their applications for image classification and automatic image retrieval. The approaches are: simple convolutional neural network, AlexNet, GoogleNet, ResNet-50, Vgg-16, and Vgg-19. We compared the performance of the different approaches to prior works in this domain by using known accuracy metrics and analyzed the differences between the approaches. The performances of these approaches are investigated using public image datasets corel 1K, corel 10K, and Caltech 256. Hence, we deduced that GoogleNet approach yields the best overall results. In addition, we investigated and compared different similarity measures. Based on exhausted mentioned investigations, we developed a novel algorithm for image retrieval.


INTRODUCTION
Today, digital photographic devices are widely used resulting large volumes of digital images have being acquired and stored in databases in different fields such as scientific research, medical, forensic analysis, and social networking. So, the retrieval of these images should be done effectively and fast. Information retrieval (IR) attempts to find material such as images or texts (documents) which have unstructured form to get information from large volume of these materials [1,2]. In early image retrieval systems, images are indexed in a database using textual annotation such as keywords or phrases. A user asks the system to find similar images by entering the textual annotation and the system retrieves images in order according to the degree of match to the annotation. However, some limitations face such a method. For instance, it is time consuming to annotate images in a large-scale database manually and the text may not available during image capturing respectively. Consequently, content-based image retrieval (CBIR) is a process that extract image feature (visual content) to represent images automatically and index them in a database [3]. Figure 1 illustrates a typical diagram of CBIR system that stores images in the database by extracting image features at off-line phase [4]. Meanwhile, the system extracts a feature vector from a query image in the The main contributions of this paper are as follows: first, convolution neural networks (CNNs) are investigated to classify huge amount of images. In our investigation, different deep learning approaches are used in classification such images. Second, the CNNs approaches are exploited to learn features of images for image retrieval. Third, different distance functions are tested for similarity measures. The aim is to judge which deep learning approach can produce effective features and which distance function is more accurate to reduce the semantic gap issue in CBIR. Consequently, a novel algorithm for image retrieval is developed. The remainder of this paper is structured as follows. The relevant literatures are presented in section 2. The proposed CNNs used in this paper are described in section 3 while section 4 describes the images datasets used in the investigation and presents the experimental results analysis of our evaluation system. Finally, section 5 draws the finding of our paper and gives a recommendation for further works.

RELETATED WORK
Numerous studies of literature have investigated CNNs in image retrieval. In this section, we will present some of the literatures using CNNs in these studies. For example, in [5] three CNN features for IR are proposed by fusing the product rule and the weighted average of features similarity. The authors extract the features of images using three kinds of CNNs. After that, by using product rule, the weighted feature similarities between the query and database image are calculated. Finally, the retrieval result is found by returning the images with the highest top-n scores. Also, in [6], the features of the images are extracted by analyzing the classical CNN and then the results are compared with three classical algorithms. The performance of the retrieval system is improved by combining a cosine similarity measurement approach. A deep CNN model is utilized in [7] to extract the feature representation from the activations of the convolutional layers in a large image dataset for applications such as remote sensing and plant biology. Then database indexing structure and recursive density estimation are established to retrieve the images in a fast and efficient way. Also, to improve the accuracy of the image retrieval and prevent the overfitting of training a CNN, the authors in [8] propose a deep CNN with L1 regularization and an activation function named PRelu. The deep network is successfully used to simulate the brain of human by receiving and transferring information and it contains a convolution operation which is appropriate in image processing.
In [1], deep belief network is investigated and trained to learn large scale representations from the images for application where CBIR jobs are used. In that work, similarity measures are applied for CBIR tasks. The authors in [9] investigate the using of CNN for CBIR jobs as well where different setting are implemented and tested. A hybrid of CNN and support vector machine (SVM) model is proposed in [2] using the minimum number of materials and time resources. The last output layer of the proposed CNN is changed with a classifier based on SVM. There are two parts used in that work, convolutional part and recognition part. In the convolutional part, the images are passed through a sequence of several filters where new images are forming named convolution maps. In the recognition part, a SVM classifier is trained to automatically extract features on testing images and take the final decisions. A kind of deep learning is applied to classify images in [10]. AlexNet deep learning network is effectively used on images selected from ImageNet database. The experiments are conducted on the images after cropping images for different areas. In [11], the semantic features of the images are extracted using CNN model. Then, a distance function is computed to find the similarity between the semantic features of the images.
In [12], a CNN called ConvNet are trained to classify medical images. The medical images are acquired using computed tomography of an organ or body part-specific anatomical. The performance of the classification is improved using data augmentation. Also, deep CNNs are proposed in [13] for content based medical image retrieval. For retrieval process, two approaches are proposed. The first approach, the network is trained to get the prediction of the query image class and then the specific class is searched for relevant images. In the second approach, the whole dataset is searched for the relevant images without including information related to the query image class.
A CBIR system is built using a combination of deep features generated by CNN and SVM to train a linear hyperplane in [14]. The authors use CNN for feature extraction while SVM is applied to find the similarity between image pairs. A deep representation for image retrieval called regional-maximum activations of convolutions (R-MAC) is built in [15]. Using R-MAC, a number of image regions are aggregated into a small and fixed length feature vector robust hence it is robust to scale and translation. This deep CNN gives high accuracy since it can deal with images have high resolution of different ratios. In [16], a CNN model is trained on ImageNet-2012. Then, for CBIR task, the four layers, which are extracted as the feature representation of the data, are evaluated using the retrieval performance. Finally, the original features are compared with the binarized feature representation.
Different CNNs with application to CBIR tasks are examined and compared using varied settings in [9]. The features representation of the images and the similarity measures between image pairs are learnt to process the tasks of CBIR. The authors attempts to approve if CNNs are effective in learning the features of images when applied to CBIR tasks. A deep CNN model is proposed in [17] to learn the features representation from the activations of the convolutional layers. The authors suggest three retraining methods in order to improve the performance of the retrieval process and the amount of the required memory. These are: fully unsupervised retraining when no information is available but only from the dataset itself, retraining with relevance information when the labels of the training data are exists, and relevance feedback-based retraining when there are feedbacks from users.

DEEP CONVOLUTION NEURAL NETWORKS
Over the past years there have been extensive studies using deep learning networks (DLNs), for example, deep belief network, Boltzmann machines, restricted Boltzmann machines, deep Boltzmann machine, and deep neural networks (DNN) [9]. In this study, we have investigated, compared, and evaluated some common DLNs and their applications for image classification and automatic image retrieval. These are: AlexNet, VGG-16 and VGG-19 networks, GoogleNet, ResNet. We also have compared the performance of these networks to prior works in this domain by using known accuracy metrics and analyzed the differences between the approaches. In the following subsections, we will explain these DLNs.

AlexNet
AlexNet is a kind of DLNs introduced by Alex Krizhevsky [18]. The architecture of AlexNet convolutional network is illustrated in Figure 2. As shown in this figure, convolution and max pooling operations are implemented at the first convolutional layer with local response normalization (LRN). The convolutional layer parameters consist of a set of learnable filters. These filters can be used to calculate the features of the images in classification. The filters of the convolutional layers are updated by performing the full convolutional operation on the feature maps between the convolutional layer and its immediate previous layer. In this layer, about 96 different receptive filters are used where the sizes of these filters are 11*11. Also, a stride size of 2 and 3*3 filters are used to perform the max pooling operation. The job of pooling layer is to reduce the computational complexity when nonlinear down sampling is performed. The same operations are implemented but with 5*5 filters in the second layer, 3*3 filters with 384, 384 and 296 features maps in the third, fourth and fifth convolutional layers. More image details and local feature images are extracted since the size of convolutional layer and stride is small. Two layers, which are fully connected (FC), are used with dropout. In AlexNet network, the problems of training time consuming and over-fitting problems are solved by dropout operation. Finally, a softmax layer is used. AlexNet has been used in a wide range of applications such as object detection, video classification and image segmentation [6,12,[19][20][21][22].

VGG-E Net
VGG-E net has been proposed by Simoyan et al. to simulate the relation of depth of the network with its capacity, VGG-E net made 19 deep layers comparing with AlexNet. Figure 3 shows the architecture of the VGG net. It consists of ReLU activation function which is used by two convolutional layers. ReLU is also used by a single max pooling layer and some fully connected layers. The purpose behind putting max pooling after the convolutional layer is to tune the network and the padding is done to preserve the spatial resolution. The last layer is a softmax layer which is used for classification. The size of the convolution filter is 3x3 and has a stride of 2. By using small size of filters, it provides low computational complexity and reduces the number of parameters. There are different kinds of VGG-E models were proposed. These are: VGG-11, VGG-16, and VGG-19 where these models have 11, 16, and 19 layers respectively. Although, the three models of VGG-E have three fully connected at the end, VGG-11 contain 8 convolution layers, VGG-16 has 13 convolution layers and VGG-19 contain 138M weights and 15.5M MACS [21,23].

GoogleNet
GoogleNet DLN is proposed by Christian Szegedy et al. [22]. GoogleNet network has been especially designed to reduce the computational cost and achieve high accuracy compared with traditional CNNs. It presents the concept of inception block. It helps in combining multi scale convolutional transformations by exploiting the idea of split merge and transform operations. Thus, different types of variations in the same category images with diverse resolutions are learnt. Inception blocks are used in replacing the conventional layer. They hide filters of different sizes (1*1 and 3*3) to capture spatial information [23,21].
The architecture of GoogleNet is illustrated in Figure 4. In this network, nine inception modules are used consists of 22 layers. Although, GoogleNet has many layers compared to other networks before it, the number of the parameters is much lower than AlexNet and VGG networks. It has 7M parameters while AlexNet and VGG have 60M and 138M parameters respectively. Also, GoogleNet network has four max pooling layers and one average pooling layer i.e. only layers with parameters. The average pooling layer has a filter with a size of 5*5 and has three strides which is used before the classifier. It also uses dropout layer which has a ratio of 70% from dropped outputs. All convolutional layers and inception modules use ReLu [21,22].

ResNet
Deep residual networks or called ResNet is proposed by Kaiming He et al. [24]. It is one of the states of art and greatest CNNs used for image recognition. In ImageNet Large Scale Visual Recognition Challenges in 2015(ILSVRC-15), ResNet won that challenge with a top 5 error of 3.57%. For instance, ResNet-50 has reached an average of 5.25% of top-5 error when it is trained on 1.28 million training images in 1000 classes. It has shown a high accuracy in computer vision. Figure 5 shows the architecture of ResNet-50. In this study, ResNet-50 has been used for image classification. In this network, 5 convolutional layers are used and the input images are of size 224*224*3. ResNet-50, which has 50-layer CNN architecture, is considered to be the first deep CNN that applied residual learning [24,25].

PROPOSED METHOD
In this work, two scenarios are followed: image classification and image retrieval. Figure 6 shows the stages of the framework, training, CNN model training, image classification, feature extraction, similarity measure, and image retrieval. For image classification, CNNs are investigated to classify huge amount of images. In our investigation, different deep learning approaches are used in classification such images. The CNNs approaches are exploited to learn features of images. Image classification is achieved by two stages. First, a set of training images that associated with class label are used to train a classifier. Second, the trained classifier is used to predict the class label of a query image based on its trained knowledge about the class. Hence, the accuracy of the classifier is evaluated according to correct prediction. Image retrieval is implemented using features that are learned by the CNNs approaches and then results are compared. Based on outcomes and analyses a new algorithm for image retrieval is developed (see subsection 4.2.3).

Data sets
Different datasets have been used for testing algorithms or approaches in CBIR. The datasets used in this paper to evaluate the performance of CNNs are datasets with a high quality where the images are non-labeled and compressed. Datasets corel 1K [26], corel 50K [26] and Caltech 256 [27] are used in this work to validate the proposed system.

Corel 1K
Corel 1K dataset [26] consists of 1000 images with 100 for each class. The size of images is (256x384) or (384x256) each image may be one of the ten class labels (African peapole, beach, buidings, buses, dinosaurs, elephants, flowers, hourses, mountains, and foods). These labels are annotated manually using an Excel file. A sample of 20 images is shown in Figure 7 with their labels.

Caltech 256
Caltech 256 dataset [27] consists of 30,607 images of objects with different sizes. Images are divided into 256 classs. Researchers select some classes to evaluate their approaches or algorithms. In our experiment, we chose 50 classes with 100 for each class. Figure 9 shows sample of some images.

Experimental results and analysis
In this section, we present the results of the experiments conducted to evaluate the accuracy of IR and computational efficiency based on prposed CNNs in terms of image classification and image rerieval. Image classification is achieved by two stages. First, a set of training images that associated with class label are used to train a classifier. Second, the trained classifier is used to predict the class label of a query image based on its trained knowledge about the class. Hence, the accuracy of the classifier is evaluated according to correct prediction. IR returns top T images as a ranked list from database images that are most similar to a query image by using a similarity measure without using class labels. The accuracy is evaluated according to how many correct images out of the T images in the ranked list. All experiments are performed using MATLAB 2018a, on a computer with a processor Intel core i7 CPU 2.5 GHz 2.6 GHz and 8 GB RAM.

Evaluation of the performance
In image classification, a confusion matrix is usually used to evaluate the performance of a classifier. Table 1 shows a confusion matrix for two classes and it can be extended into m classes (i.e. m x m). True positive (TP), true negative (TN), false negative (FN), and false positive (FP) are the terms given to an image classification test [28]. Precision or accuracy is calculated as follows: where, AC is the precision or accuracy. In image retrieval, a mean average precision (MAP) is used for evaluation based on precision (P) and average precision (AP) [28].
where, is the precision of image retrieval, is number of relevant retrieved images and is total number of retrieved images, where, is average precision of image retrieval, is precision of image in the class, and is total number of images in the class.
where, is mean average precision of image retrieval, is average precision of class image, and is total number of classes in the database.

Image classification
Many experiments are conducted on the image datasets. The training models of the networks are set up as follows: the datasets are divided into 70% for training, 15% for validation and 15% for testing data. In addition, the training parameters for the CNNs are set as follows: the learning rate is set to 0.00001; the maximum epoch number is 435. Also, the weight of the learning rate factor and bias learning rate factor are set to 20 for the layer of fully connected.
The most common CNNs used in the paper as mentioned in the previous section are: simple CNN, AlexNet, GoogleNet, ResNet-50, Vgg-16 and Vgg-19. These models are compared with the conventional methods used for IR such as the hue saturation value (HSV) colour feature, gray level co-occurrence matrix (GLCM) features and scale invariant feature transform (SIFT) [6]. The accuracy of the results of the testing and validation data sets is used on image data to evaluate the performance of these methods. The results of the conventional methods and CNNs models as feature extractors based on corel 1K dataset are shown in Table 2 with data augmentation. As can be seen from this table, the best accuracy are 99%, 97% and 95% achieved by CNNs models when the training, validating and testing data are augmented compared with the conventional approaches. For the corel 1K datasets, the models based on the CNNs models did converge to excellent accuracy and demonstrate high performance in training stage with the least number of epochs. Although, there are no significant differences in the convergences of the models (more than 95%), they took more training time for convergence as the complexity of the CNNs are increased. On the other hand, the convergence accuracy results for the same datasets without data augmentation have not given good accuracy. For example, the testing accuracy is 10%, 46% and 68% for the simple CNN, AlexNet and GoogleNet respectively. It is shown that to improve the performance of CNNs, data augmentation can successfully be used.
A sample of 30 class probabilities results for both AlexNet and GoogleNet convolutional neural network as feature extractors with augmentation is shown in Figure 10. From the results, it is observed that most classes have high accuracy, the classification is almost successful. Also, it is shown that the CNNs models results are superior to the known three methods. On the otherhand, the results of the CNNs models as feature extractors based on corel 50K and Caltech 256 datasets are shown in Table 3 and Table 4 with data augmentation. From the experiments, it is apparent that the corel 1K images data are classified correctly using CNNs models with high accuracy while the Caltech 256 data with 50 classes has low accuracy. It is concluded that the features of the images are learnt from the pre-trained models and it does not need to search features manually.

Image retrieval
As mentioned earlier, feature representation is one challenge of semintac gap in CBIR. Recently, CNN has been used to learn features to be more accurate. Hence, our aim in these experiments is using above CNNs approaches to learn features and handle image retrieval without using class labels by using them. Resulted features from five CNNs (AlexNet, GoogleNet, ResNet-50, Vgg-16, and Vgg-19) are seperately tested according to the framework in Figure 6.
Firstly, experiments of image retrieval are conducted on corel 1K standard database to judge which deep learning approach can produce effective feature than others. Leave-one-out manner is used to calculate Precisions (P) for images and then MAPs are computed. City-block (L1) distance function is used to compute the similarity between a query image vector (feature) and database images vectors. Resulted similarity values are ranked in ascending order. Top (5-100) retrieved images in terms of MAPs for CNN approaches are calculated and illustrated in Figure 11.
It is clear that the performance of using feature (10D) that is produced from GoogleNet with 22-layers is more effective and robust than others as long as Top (5-100) retrieved images. Meanwhile, AlexNet with 20-layers extracted feature (4096D) that has lowest achievement. Vgg-16 and -19 produced features which are the same as that of AlexNet in length but they performed higher. ResNet-50 extracted a smaller dimension of feature which is 2048 compared to the AlexNet, Vgg-16 and -19 approches but the features are more robust especially at Top30-100 ranked list of images. Therefore, it is interest to analys individual class images between Googlenet and Resenet-50 at Top100 retrieved images. Hence, APs are clarified in Figure 12.
At the first view, there is a big difference between two approaches where the performance of GoogleNet is higher than ResNet-50 over all classes except for the bus class, the rate is equal. In order to judge how the difference is significant, a t-test statistical method is used that can be calculated as [29]: where ̅ 1 and ̅ 2 are the sample precision rates ( ), S1 and S2 are standards deviations, and 1 and 2 are the sample sizes. Two hypotheses are regarded and determined based on t-test, the null hypothesis (H0) where ̅ 1 − ̅ 2 = 0 and alternative hypothesis (HA) where ̅ 1 − ̅ 2 ≠ 0. P-value of the test is the probability of observing a test. Small values of p refers to that the null hypothesis is rejected at significance level 0.05. For each class in the corel 1K database, the test was computed. This means the size of each sample is 100 elements (i.e. precision values). Hence, the first sample (S1) and second sample (S2) have precision rates of Top100 retrieved images from using Googlenet feature and ResNet-50 feature respectively. The t-test proved that all diffrences between precesion values are signifigant even for Buses class. Figure 13 shows the two samples where the most values of S1 99% compared to S2. We can conclude that GoogleNet learened a feature with low dimension (10) means less computation and high accuracy due to the inception block that exploits split, merge and transform operations to combine multi scale convolutional transformations. Therefore, different types of variations in the same category images with diverse resolutions are learnt. In other words, Googlenet has ability to extract more discriminative information about interested objects than Resnet-50 at layer 22.
We conducted other retrieval experiments to investigate the second issue in CBIR which is similarity measures. In the literature, different measures have been used to compute the similarity between a query image and database images depending on image descriptor. For instance, the descriptor is represented as a single vector or a set of vector, in linear space or non-linear manifold. [29,30]. Hence, correlation (D1), cosine (D2), and Euclidean (D3) were applied rather than city-block (D4) in our system separately. Suppose and refere to query image and database image feature vectors respectively with the dimension, then D1, D2, D3, and D4 are defined as follows [31].

TELKOMNIKA Telecommun Comput El Control
where ̅ = 1 ∑ and ̅ = 1 ∑ Figure 11. MAPs for CNN approaches using corel 1K  Table 5 shows APs for individual corel 1K classes for Top20 retrieved images compared to recent work in 2020. It is clear that the ability of the Googlenet approach to learn feature with low dimension (10D) led to reduce the semantic gap across all classes using above four distances. Hence, our proposed method achieved remarkable rates comparing with recent methods [32,33] which are more complicated. Where the method in [32] combines two features in terms of fusion, the first one was produced from using detects salient objects, spatial color and texture features and the second one from using ResNet CNN approach. Our experience referred to that fusing normal and CNN features degrade or do not affect rates of image retrieval when we fused the learned feature from Googlenet CNN and global local binary patterns (LBP) colour texture feature (177D) from YCbCr images. Meanwhile, the method in [33] used the fusion between two normal features. The first one can detect shapes, objects, and texture by locating interest points and the second one is color features extracted from the spatially arranged L2 normalized coefficients. This evidence supports that the learned feature from CNN approachs is more effective. Table 6 illustrates APs of image retrieval for Top100 retrieved images using above four distance functions to calculate the similarity between the query image and database feature vectors. As we can see that correlation and cosine perform equally and are higher than city-block overall classes because the cosine similarity between two images is the cosine of the angle formed by two vectors relative to visual content of images and the correlation similarity between two vectors is a mean centered cosine similarity. Both similarity measures are subtracted from 1 as in (7) and (8). Meanwhile, Euclidean approaches the correlation and cosine distances. To judge the significant differences betweeen D1 and D4, t-test was used by taking samples of precision values that were achived from D1 and D4 for each class as shown Figure 14. Then the t-test proved that alternative hypothesis (HA) is not equal to zero and values of p are small means the null hypothesis is rejected at significance level 0.05 for all classes as shown Table 7. We expanded the experiment to corel 10K and Clatech250 databases with 50 classes using the best approach (i.e. Googlenet). The approach produced a learned feature within 50 dimentional in length for both databases. Consequently, the process of image retrieval was applied using above four distances. Results showed that D1 and D2 performe better than D3 and D4 about 6% more for Top100 retrieved images as shown in Table 8. Hence, we ended up with a novel algorithm that uses the Googlenet CNN approach to learn image feature and correlation or cosine distance function to compute the similarity between query and database images as shown in Figure 15.

CONCLUSION
In this paper, a novel algorithm for image retrieval using a deep neural networks learning was developed based on experience from exhausted experiments in terms of image classification and retrieval when class label is available and unavailable respectively. Different CNNs are used and compared with the other conventional IR methods. The developed algorithm used Googlenet CNN approach to learn feature and correlation/cosine distance function to compare two images. Hence, remarkable rates were achieved comparing with recent methods due to the effective learned feature and accurate distance function. The semantic gap challenge was consequently reduced. We plan to evaluate this algorithm on faces and medical database images. Also, our future investigation is to implement CNNs approaches using different colour spaces such as YCbCr and HSV to see the impact on accuracy.