Image Caption Generation and Comprehensive Comparison of Image Encoders

Image caption generation is a stimulating multimodal task. Substantial advancements have been made in the ﬁeld of deep learning notably in computer vision and natural language processing. Yet, human-generated captions are still considered better, which makes it a challenging application for interactive machine learning. In this paper, we aim to compare different transfer learning techniques and develop a novel architecture to im-prove image captioning accuracy. We compute image feature vectors using different state-of-the-art transfer learning models which are fed into an Encoder-Decoder network based on Stacked LSTMs with soft attention, along with embedded text to generate high accuracy captions. We have compared these models on several benchmark datasets based on different evaluation metrics like BLEU and METEOR.


Introduction
One of the human's abilities is to express the conditions they are present in. Whenever an image is given, it is easy for humans to tell all the things about the image with just a glance. [1]. Developing machines with the ability to understand and interpret the real world are one of the driving forces for researchers in the domain of artificial intelligence. Two main methods used in the previous literature in modelling the pronunciation -variation [7] [8] Knowledge-based approach, that uses phonetic and linguistic knowledge [46] to write phonological rules that handle variants in pronunciation. Data-driven approach uses a corpus from real speech to derive the variation in speech. The chosen approach depends on the type of variation you need to handle in your work and the purpose of handling these variations [6]. The pronunciation variation modelling should be considered in three levels: the pronunciation dictionary, acoustic model, and the language model [9].
Even though extensive research has been made in different computer vision problems, such as object recognition [2], [3], attribute classification [4], [5], action classification [6], [7], image classification [8], and scene recognition [9], [10], making a computer automatically describe an image with human-like sentences is comparatively a new task. Utilizing a machine to automatically create a natural language description for an image, named image captioning, is challenging. Combining both the research communities of computer vision and NLP, captioning an image not only needs a notable understanding of the visual contents of an image but also needs to turn these understandings into a human-like sentence. Determining behaviors, attributes, DOI: 10.5281/zenodo.5196025 Received: March 01, 2021 Accepted July 27, 2021 and connections of objects in an image is not an easy task. Converting the visual understandings into human readable sentences makes this task even more difficult. Since natural languages constitute most of the human interaction, whether written or vocalized, enabling machines to describe the visual world will lead to a substantial number of feasible applications, like building natural human-robot interactions, new childhood learning, information retrieval, and visually impaired aid, and more.
Being both challenging and significant, the image captioning field is getting widespread recognition all around the globe. Image captioning has a wide range of application including self-driving cars and aid to blinds. Provided an image, the motive of image captioning is to form a sentence that is grammatically credible and semantically valid to the content of the image as shown in Figure 1. This process involves two steps: Visual processing and linguistic processing. To assure that the generated captions are grammatically and semantically correct and to deal with problems arising from the corresponding modality and integrated competently, techniques of computer vision and NLP are utilized So, by this end, many methods discussed below. Though the image captioning task is complex, the latest breakthroughs in deep neural networks [11][12][13][14][15][16], used extensively in the domain of computer vision [17][18][19][20] and NLP [21][22][23][24], made it easier and hence image caption generating machines based on deep neural networks came into existence. Robust deep neural networks implement effective solutions to visual and language modelling. Therefore, they are used to supplement existing systems and design many new approaches. Engaging deep neural networks to handle the image captioning task resulted in state-of-the-art outcomes [25][26][27][28][29][30].
With the recent progress in transfer learning and image captioning, we propose a novel architecture that compares multiple transfer learning models based on different metrics includes BLEU Score and others.

Related Work
Progress in the field of machine learning has opened new avenues of using deep neural networks instead of hand-engineered features and shallow models used earlier.
To generate a descriptive caption for a given image, Socher et al. applied dependency-tree recursive neural networks to convert phrases and sentences into compositional vectors. They used another deep neural network [31] to convert images into a feature vector. Ma et al. proposed a multimodal convolutional neural network [32] to measure the relationship between images and captions based on different levels of interactions between them. This framework included CNN's to encode the image [33], [34], a matching CNN to relate visual and textual data [35][36][37][38] and multilayer perceptrons for scoring compatibility of image and caption data. The author used various modifications of matching CNN's to establish the correct relationship between images and captions. An ensemble of the multimodal convolutional neural network determines the final matching score. By taking into consideration, recent advances in neural machine translation [22], [39], [40], the encoderdecoder framework is applied to generate captions for images. Kiros et al. introduced an encoder-decoder framework in the field of image captioning to merge joint image-text embedding models and multimodal neural language models so that a sentence output is generated word by word [33] for a given image like language translation. For the purpose of encoding the data, Kiros  ranking loss, this encoded visual data is extended into an embedding space spanned by LSTM hidden states that encode textual data. Finally, a structure-content neural language model is used to decode visual features based on context word feature vectors to generate captions word by word. Inspired by the human visual attention mechanism [42], [43], utilized attention mechanism to guide the image caption generation. By adding an attention mechanism to the encoder-decoder framework, caption generation depended on the values generated by the hidden states as well as on different parts of the image served by attention mechanism.

Methodology
In this paper, we proposed a novel architecture for image caption generation which extracts image feature vectors using different pretrained models and encoder-decoder model for caption generation. Following are the tasks performed in chronological order to achieve the goal.

Data Collection
In this paper, we compare the results of a different pre-trained model on benchmark datasets like Flickr 8k and Flickr 30k. Flickr8k [26] contains 8,000 images obtained from Flickr. The images on this dataset include people and animals. There are five sentences corresponding to every image collected by a crowdsourcing service from Amazon Mechanical Turk. During the image annotation process, workers are instructed to just focus on the image ignoring the context behind them. Flickr30k [44] is an extended Flickr8k dataset. There are in total 31,783 image data points in the dataset. Every image is annotated with five captions deliberately written for it. The images in this dataset are about but not limited to people involved in normal activities and daily events.

Text Cleaning
For text cleaning, some basic cleaning operations are performed like converting all the words to lowercase (otherwise "hello" and "Hello" will be considered as two different words), deleting special tokens (like '%, '$, '#, etc.), eliminating words which are alphanumeric (like 'hey199', etc.). A vocabulary is created of all the words present in all the captions. A vocab is created which stores all the word and their frequency, respectively. For creating a caption predictive model, words with higher occurence rate present in the vocabulary are the ones which are more likely to occur or which are quite frequent are chosen by deciding the threshold i.e. the minimum frequency of a word in the entire dataset. This removes the model's dependency on outliers and make model more robust. The maximum length of the caption among all the captions is calculated i.e. the count of words present in the caption respectively. Total number of distinct word are called vocab size. Vocab size decides the total number of neurons in the output layer of the merge model.

Data Preprocessing
Data available from datasets need to preprocessed before feeding them to train the models. There are 2 types of data available to us i.e. images and their corresponding captions. Following are the detailed explanation of preprocessing each of the inputs individually.

Captions
The main motive is to generate appropriate captions for every image. So, during the training phase, The model is learning to generate the captions as the target variables (Y). The prediction of caption for the given image is not done at once. The prediction of the caption is done iteratively word by word. Thus, each word is needed to be encoded into a fixed size word vector. Two tables are maintained word to idx and idx to word. Word to idx is the table in which all the words are stored, and each word is given a integer value. Similarly index to word is also a table which is storing the reverse mapping of the word to index table. Lets say word to index table stores a word "abc" and is mapped with a integer value K then the index to word table will have a integer K which will be mapped with the word "abc". The preprocessing of the caption is done to maintain some basic similarity between all the caption present. So, in order to achieve the above we modify out caption and add a special token at the starting and ending in each DOI: 10.5281/zenodo.5196025 Received: March 01, 2021 Accepted July 27, 2021 of the caption. This will help the model to remember whether a caption is starting or ending. These special starting("<seqstart>") and ending("<seqend>") tokens are also added to both the tables word to index and index to word, respectively. All the captions are made of similar length so to not waste the import data that is collected, all the captions are made to have the same number of words as the caption with maximum number of words. This is done with the help of padding the captions which have less number of words is padded with a word and that word is stored in the word to index table with an integer 0 and in index to word at an index 0.

Images
All the datasets consist of images with real world entities. To make things easier, we can make use of transfer learning as the pre-trained models such as ResNet, Inception, and EfficientNet are available which are already trained on the millions of similar images and have a remarkable accuracy in classifying the images. These models can be used to extract feature vectors from images. Datasets contains multi-channel images stored the form of 3-d matrix having the red, green and blue color channels values. The values are in the range from 0 to 255. To speed up the process, computation speed scaling is performed, and the values of each pixel is made to lie in range of -1 to 1. After removing the final layer from the pretrained model, a bottleneck layer is used to produce the information vector. Image features vector is then passed as an input to the final merge model.

Model Architecture
In this paper, we propose a novel architecture for image caption generation as shown in Figure 2. The model firstly consists of a transfer learning model which converts the input image to a feature vector and an embedding layer that embeds the captions corresponding to each image. This part of the model is called Feature Extraction Model. Then the embedded data is sent over an encoder-decoder network with soft attention which generates a next word given the image and the partial sentence. This model is known as the Merge model. For a given image, the model generates a sequence of words as an output caption y encoded in the form of Here the vocabulary size is K and n represents the caption with maximum length. Following are the detailed explanation of these models with different techniques used.

Feature Extraction Model
This model consists of a pre-trained deep learning model to convert images into feature vectors and a word embedding layer to convert captions into feature vectors. Image feature vectors are extracted by slicing the last softmax layer from the transfer learning model. Using the different pre-trained model to encode the image, we extracted V vectors of D dimensions each from the lowest convolution layer i.e global average pooling layer, such that each of them represents a part of the image. Stated formally, we extract an image representation p such that, Different pre-trained models were used to extract feature vectors from the images.

ResNet
A residual neural network (ResNet) [48] is a deep learning model whose construction is based on science behind the pyramidal cells in the cerebral cortex. ResNet models achieve this capability by making the use of skip connections. Generally double or triple layer skips are used with non-linearities (ReLU) and batch normalization to implement general ResNet models. HighwayNets are a special Resnet model which utilizes a separate weight table to learn the skip weights. DenseNets are models with several parallel skips. With a considerably increased depth, Residual networks are easier to optimize and achieve high accuracies.
Residual networks with a depth of 152 layers have lower complexity irrespective of being 8 times deeper than VGG nets. An ensemble of these ResNets achieved 3.57% error on the ImageNet test-set.

Inception
The Inception network [49] was a principal invention in the evolution of Convolutional classifiers. This invention removed the general idea of stacking up convolutional layers to make CNN deeper to get better performance. The authors of the inception net proposed to make the network wider rather than deeper. The inception net use three different sizes of filters i.e 1x1, 3x3 and 5x5 to perform convolution operation with an additional max-pooling layer. Their outputs are then concatenated and sent to the next inception module. Inception net has evolved iteratively with each version having a significant improvement over the previous one. The inception net v3 inherited all updates from inception net v2 with the addition of rmsprop optimizer, factorize 7x7 convolution, batch normalization in the auxiliary classifier and label smoothing.

Efficient Net
Efficient Nets [50] are the new series of models which utilizes neural architecture search to obtain and scale new baseline networks. Efficient Nets have achieved much better accuracy and efficiency than previous Convolutional Networks. In particular, Efficient Net-B0 being 4.9x smaller and 4x faster on inference achieved state-of-the-art 77.1.3% top-1 accuracy on ImageNet. Efficient Nets uses a multi-objective neural architecture search that optimizes both accuracy and FLOPS. Using the compound scaling method, Efficient Net models can scale up effectively, surpassing state-of-the-art accuracy having fewer training parameters and FLOPS.
We use pre-trained Word Embeddings for the input sequence of words. For each word, we get an embedding vector of length m where m = 300 in this case. Hence for a sequence of n words, we get a nm matrix as an input to the LSTM Here m represents the number of features in the input and n represents the length of the sequence. More formally, we can state that : Here n represents the length of the sequence, the embedding layer generates a sequence : Here n represents the length of the sequence and e i ∈ R m×K . Here K represents the size of the vocabulary and m represents the embedding dimensions.

Merge Model
Up till now, we have encoded images and captions into feature vectors. The output of the image encoder and the encoded form of the caption generated so far are merged in merge model to predict the caption. Figure 3 shows the visual representation of the working of the merge model. The combination of these two encoded inputs is then used to generate the next word in the sequence.
In merge architectures, we used Three Stacked Deep Long Short Term Memory(LSTM) layers with soft attention to combine both the encoded image input with the caption generated so far.

Long Short Term Memory Network
We developed a Deep Long Short Term Memory (LSTM) network by stacking multiple LSTM layers on top of each other such that the input of an LSTM layer is the output of the previous LSTM layer. The basic description of an LSTM cell is depicted in Figure 4. In an LSTM cell, there is a single cell state and three gates which are the input, output, and forget gates. During each time-step t, the generated cell state c t1 and the hidden state h t1 which were generated during previous time-step t1 , are forwarded back to the LSTM. The input u t is received at present time t. If we use f LST M (·) to represent a feed-forward function of LSTM, we can say that the LSTM updates its state by : The internal computations of the LSTM on its gates and memory cells are enumerated as follows : Here R represents the recurrent weight matrix, W represents the input weight matrix learned during training and b represents the bias vector. Also, σ denotes the sigmoid function which is expressed by σ(x) = 1/(1 + exp(x)). It has a squashing effect and condenses the input into the range of (0, 1). T anh is a hyperbolic tangent function and produces values in the range (-1,1) to avoid the explosive growth of values over time.
Both the functions are computed in an element-wise manner. i t , o t and f t denote the input, output and forget gates respectively. To compute them we add the linear projections of u t and h t1 followed by the output of the sigmoid function. The input transformation z t , the cell value of the previous state c t1 and the output of the element-wise multiplication output which is denoted by * are modulated by the input, forget and output gates, respectively.
where the annotation vectors a i , i = 1, ..., L are the features that correspond to different image sub-regions. Also, since we use a multi-layer LSTM stack, the first layer LST M 1 takes the word embeddings of the input sequence as the input. For LSTM layers LST M 2 and LST M 3 , the input is the output of the previous layer.

Soft Attention
In the soft attention approach, in addition to the word embeddings of the input sequence, we also use context vector s t i.e. a representation of the image for that time-step which provides information about the relevant portion of the image for that time-step. Hence the LSTM outputs would be calculated according to the following equations : where, For the computation of the context vector s t , we adopt the same mechanism, where for each annotation vector a i corresponding to a location in the image, a positive weight α i calculated by the attention mechanism which denotes the relative importance of the location for generating the word at the present time-step. The attention model m att is a function of α i and h t1 , i.e, the annotation vector and the hidden state at time-step t1. From the weights associated with each image region, the context vector s t can be calculated as : where φ returns a single vector for the image.

Result
We perform extensive experiments to evaluate the proposed models. We report all the results on Flickr8k and Flickr30k dataset using BLEU and METEOR.

BLEU
BLEU [45] is an evaluation metric that is used to match variable length phrases of a predicted or generated sentence to original sentences which are written by humans to measure their correlation. BLEU score is calculated by comparing a predicted or generated sentence with original sentences in n-grams. Categorically, BLEU-1 is calculated by comparing the predicted sentence with original sentences in uni-gram, while BLEU-2 is calculated by using bigram for matching. The best correlation with human judgement is obtained by empirically determining BLEU with a maximum order of four. In BLEU higher n-gram scores are responsible for fluency and the uni-gram scores are responsible for adequacy.

METEOR
METEOR [47] is an automatic machine translation evaluation metric. It has two steps of the calculation, first one being performing generalized uni-gram matches between a predicted sentence and original sentence which are written by humans and the second one computing a score based on the results matched. The computation associates calculation of recall, precision and alignments of words matched. While calculating for more than one original sentence, the highest score among all the scores that are calculated independently is considered as the final result for the predicted sentence. This metric was introduced to address the shortcoming of BLEU metric, which is based only on the precision of the n-grams matched.

Performance on Flickr 8k
We calculated different evaluation metrices on different models to find out the best pretrained model for image captioning.
Model based on Resnet 50 performed well on Flickr8k dataset with BLEU-1 Score of 0.62 and METEOR Score of 0.153. Inception Model performed better than Resnet 50 with BLEU-1 Score of 0.627 with difference of 0.07 from Resnet. Inception model scored 0.157 on METEOR EfficientNet performed best out of the 3 models compared with a BLEU-1 Score and METEOR score of 0.636 and 0.16 respectively. EfficientNet being 4 times lighter than the others achieved an increase of 0.01 in BLEU-1 metrices. Table 1 and Figure 5 depicts detailed comparison between these models over different metrics on Flickr8k dataset.

Performance on Flickr 30k
The available models performed better on Flickr 30k dataset due to availability of 4 times more training samples. Model based on Resnet 50 performed well on Flickr30k dataset with BLEU-1 Score of 0.651. and METEOR Score of 0.172. Inception Model performed better than Resnet 50 with BLEU-1 Score of 0.657 with difference of 0.006 from Resnet. Inception model scored 0.179 on METEOR metrices. EfficientNet performed best out of the 3 models compared with a BLEU-1 Score and METEOR score of 0.665 and 0.184 respectively. EfficientNet being 75% lighter than the others achieved an increase of 0.01 in BLEU-1 metrices. Table 2 and Figure 6 depicts the detailed comparison between these models over different metrics on Flickr30k dataset.

Quantitative Results
We compare our methods with the comparable methods proposed in the literature in Table 3 and Figure 7. Images with their generated captions as predicted by the system shown in Figure

Conclusion
After evaluating different pre-trained models on available benchmark datasets, we conclude that the latest pretrained model Efficient Net performed best for image captioning. Being 75 % lighter than the other existing models, training time for the model was reduced to 12 sec per epoch and BLEU score increased upto 0.015 as compared to others. METEOR Score for efficient net model surpassed other models by a value of 0.12 making it best image feature extractor so far.