Incorporating Textual Similarity in Video Captioning Schemes

The problem of video captioning has been heavily investigated from the research community the last years and, especially, since Recurrent Neural Networks (RNNs) have been introduced. Aforementioned approaches of video captioning, are usually based on sequence-to-sequence models that aim to exploit the visual information by detecting events, objects, or via matching entities to words. However, the exploitation of the contextual information that can be extracted from the vocabulary has not been investigated yet, except from approaches that make use of parts of speech such as verbs, nouns, and adjectives. The proposed approach is based on the assumption that textually similar captions should represent similar visual content. Specifically, we propose a novel loss function that penalizes/rewards the wrong/correct predicted words based on the semantic cluster that they belong to. The proposed method is evaluated using two widely-known datasets in the video captioning domain, Microsoft Research - Video to Text (MSR-VTT) and Microsoft Research Video Description Corpus (MSVD). Finally, experimental analysis proves that the proposed method outperforms the baseline approach in most cases.


I. INTRODUCTION
The number of videos captured on a daily basis and then uploaded on the internet has been increased dramatically due to the wide usage of smart-phone devices. These videos are usually uploaded without a description. However, video captioning approaches aim to generate sentences (captions) that generally describe the visual content of videos. Broadly, video captioning approaches comprise two separate components, a feature extractor that typically extracts the features of the whole video -by sampling among the frames using a fixed number as step -and an encoder-decoder. The second component, that has been inspired from Natural Language Processing (NLP) [1] networks, firstly encodes the visual content -in the form of features -and then assigns it to words that are included in the vocabulary.
In [2], a sequence-to-sequence model that converts the video input to text is proposed. Specifically, the authors make use of a feature extractor in order to extract the features of the videos and then feed them to an encoder-decoder module. The input features are encoded using a Long Short-Term Memory (LSTM) [3] network and then -on the decoding phase-the features are mapped to specific words using also an LSTM network. Additionally, the authors incorporate another modality of information, the optical flow, and they show that can improve the accuracy of the predicted captions.
Based on the aforementioned scheme, a variety of methods have been proposed so far. Recently, methods that make use of bidirectional LSTMs and methods that solve the video captioning problem by incorporating a paragraph module have been proposed. Moreover, attention mechanisms have been widely explored on video captioning domain. More effective feature extractors have been also investigated. Furthermore, reinforcement-learning-based approaches and methods that are based on event detection have been introduced in [4], [5], [6], [7], [8], [9], [10]. From the above analysis, it can be deduced that the video captioning-related literature has in principle focused on visual information analysis, while the respective video captions' similarity has not been investigated, leaving a great potential for further performance improvement.
To address the above issue, a novel method that takes into account the textual similarity of the videos' captions to enhance the training process of the caption architectures is proposed. Based on the hypothesis that textually similar captions describe similar videos, from the visual point of view, a method that penalizes or rewards the predicted captions is proposed in this work. Specifically, the proposed method assigns the words of the vocabulary to specific clusters. Furthermore, a new loss function is introduced in order to penalize or reward the videos that are predicted with a wrong or a correct caption, respectively. The main contributions of the proposed paper are summarized below. • The proposed method takes into account the textual similarity of the captions that have been extracted from the dictionary in the form of cluster vectors, in order to drive the video captioning architectures to encode the visual content and decode it to text in a more effective way. • The modeling of the proposed method, by adding a penalty-reward function, that makes the architecture agnostic of the feature extractor and the dataset used. Therefore, it can be utilized in conjunction with any baseline architecture. • The proposed method is evaluated using two video captioning datasets: MSR-VTT [11] and MSVD [12]. After a detailed analysis, it is shown that the proposed method improves significantly the results compared to the baseline approach. The remainder of the paper, is organized as follows: Related work is discussed in Section II. In Section III, the proposed method is detailed, while experimental results are presented in Section IV. Finally, conclusions are drawn in Section V.

II. RELATED WORK
Venugopalan et al. [2] proposed a method for video captioning that learns to map a sequence of frames directly to a sequence of words. Specifically, an encoder-decoder LSTM architecture is proposed that not only takes as input the video features, but also incorporates the optical flow modality for generating more accurate captions. More specifically, an architecture that comprises two stacked LSTMs is proposed. The first LSTM network encodes a sequence of frames to a hidden representation while the second one decodes it into a sentence. Methods that are based on attention mechanisms have been also proposed. Gao et al. [6] proposed an architecture that incorporates an attention mechanism that makes use of salient features extracted from a Convolutional Neural Network (CNN) [13]. Additionally, they proposed a crossview model in order to enforce the consistency between the predicted sentences and the visual features. Pu et al. [8] proposed an attention-based architecture adaptable on different levels of CNN features.
Bin et al. [14] are the first that utilize bidirectional recurrent neural networks in order to explore the temporal structure in video captioning problem. Additionally, Wang et al. [15] incorporated a bidirectional model in order to better capture the temporal action proposals from the past, current and future events of the videos. Moreover, they took care of the overlapped events in order to improve the predicted captions. Yao et al. [16] pay attention to the feature extractor part. Specifically, they incorporated a 3-D CNN followed by an encoder-decoder for capturing the local spatio-temporal information. Additionally, an attention mechanism is proposed and the whole framework is evaluated on video description domain. Yu et al. [4] introduced a hierarchical structure in decoder stage. Specifically, the method consists of two parts: a sentence generator and a paragraph generator. More specifically, the paragraph generator takes as input the embeddings of a sentence and via a recurrent layer the paragraph state is generated. Finally, the output of the paragraph layer is used as the initial state of the sentence generator.
Recently, Shetty et al. [17] proposed a method that is using two different kinds of video features, one that consists of features and attributes of objects and one for capturing the motion and the action information. Additionally, the architecture is based on an encoder-decoder scheme and they have also proposed an evaluation model in order to pick the best caption from the pool of candidates generated. Similar to the aforementioned approach, Ma et al. [18] proposed a method, named SINet-Caption, that takes into account the interaction among groups of objects. Moreover, the authors explored the effectiveness of coarse-grained and fine-grained information of the key-frames using an attention mechanism.
Hierarchical structures have also been explored on video captioning domain so far. Pan et al. [19] proposed a hierarchical recurrent neural encoder in order to exploit the temporal information of videos on encoding stage. Additionally, the proposed method is able to exploit with a more effective way the temporal structure of long videos. Furthermore, actions that are part of a global action can be also exploited. Song et al. [20] considered that a caption contains visual and nonvisual words, such as articles, and that the second ones can be easily predicted using a natural language model that do not make use of visual features. Specifically, they proposed a hierarchical LSTM framework that can automatically select the frames that describe 'visual' words in order to generate words for video captioning. Finally, Baraldi et al. [21] proposed a method that detects the discontinuities in the input video and enables the encoding layer to modify its temporal connectivity by resetting its internal state and memory also. Reinforcement learning approaches have been also investigated on the video captioning problem. Phan et al. [22] proposed a reinforced-based method that in training process the sentences obtained from the annotated captions. Wang et al. [9] have also proposed a reinforcement learning approach. Specifically, the proposed architecture consists of two parts. A high level module, called Manager, that learns to design sub-goals and a low-level module, named Worker, that learns to recognize the actions in order to achieve the sub-goal. Moreover, PickNet [23] that has been proposed from Chen et al. aims to resolve video captioning problems. Specifically, the architecture consists of an encoder-decoder and, based on reinforcement learning, tries to pick the informative frames. However, to the best of our knowledge, the aforementioned methods do not exploit the frequency of each word in the vocabulary and do not take into account the word context among the vocabulary.

III. PROPOSED METHOD
In this section, the baseline architecture is previewed and, subsequently, the proposed method is outlined. Additionally, the pre-processing steps are described. A. Pre-processing steps In this section, the steps in order to transform the data to a suitable form are presented. First of all, each word of the vocabulary is mapped to a word embedding using the word2vec algorithm that has been proposed from Mikolov et al. [24]. Specifically, each embedding vector represents a word using a 300-dimensional real-value vector. Due to the fact that the video captioning datasets describe only short-length vocabularies, the usage of generic embeddings is mandatory. Therefore, we make use of the Google news dataset that consists of 1 billion words in order to export more comprehensive word embeddings. More specifically, each word of the dataset's vocabulary is mapped to an embedding from one of the 692K embeddings generated. For words not included in the Google news vocabulary, we perform a string similarity measure, as presented in [25], in order to assign to the most relevant embedding.
As mentioned above, the main goal of the pre-processing steps is to map each word from the vocabulary to a specific cluster. To address this, the clustering algorithm -k-meansthat have been proposed by Hartigan et al. [26] is adopted. Specifically, the k-means algorithm is repeated for 25 steps. The cosine similarity distance among the clusters' centroids and the word embeddings is taken into account.

B. Proposed approach
As in all cases in the video captioning domain, a baseline architecture that comprises a feature extractor, an encoder and a decoder module has been selected. In order to simplify the implementation of the proposed approach, the method proposed by Venugopalan et al. [2], named Sequence to Sequence -Video to Text (S2VT), has been selected as baseline method. As mentioned, the proposed architecture can be applied to any video captioning architecture in the form of an extra penalty/reward function. Due to the fact that each dataset comprises of a different number of words, the pre-processing steps should be performed on each dataset separately.
The proposed approach takes into account the words' context encoded in word2vec embeddings and their frequency of appearance. Each word from the vocabulary is mapped to a specific cluster. This information, in the form of cluster vectors that contain the frequency of appearance for each word, is used as the criterion to the introduced loss function. Equation (1) describes the function that is used in order to decide whether the word x belongs or not to cluster j. The formulation of the predicted ground truth vector is presented in (2). Specifically, j denotes the number of clusters and i denotes the max length of the generated caption while function g is described in (1). Equation (3), similarly to the previous ones, denotes the cluster vector that has been generated from the ground truth caption.
Equation (4) denotes the Euclidean distance between the predicted and the ground truth vector, (2) and (3) respectively, while λ declares the effect of the penalty/reward of the proposed loss function. Specifically, (4) calculates the global distance of the predicted caption to the ground truth caption, using as a criterion the distance between the two cluster vectors, the ground truth cluster vector and the predicted one. It should be noted that this value balances the penalty/reward functionality. If the value is < 1 the loss value of the predicted caption is decreased (reward), while if the value is > 1 the loss value is increased (penalty), and, obviously, if the value is equal to 1 there is no penalty/reward and consequently only the cross entropy loss is applied. It should be noted, that the introduced loss function is applied in combination with the cross-entropy loss by a simple multiplication. In Fig. 1 the proposed architecture is presented. Specifically, the basic processing steps of the two videos are depicted. The main modules (feature extractor, encoder, decoder) are depicted on the left. Subsequently, the processing of the vocabulary, the generated clusters and the cluster vectors are depicted on the right. Additionally, the Euclidean distance between the two (ground truth and predicted ones) cluster vectors and the cross entropy loss are placed on the center of the figure.

A. Employed Datasets
In order to evaluate the performance of the proposed approach two widely-used video captioning datasets are used, MSR-VTT [11] and MSVD [12]. MSR-VTT dataset contains 10000 videos clips from 20 categories. Additionally, each clip has been manually annotated with a set of 20 captions. Furthermore, the split-settings proposed by [11] are adopted. 6513 videos comprise the training set, 497 videos the validation set and the remaining 2990 the test set. The vocabulary of MSR-VTT dataset consists of 16860 unique words.
The second dataset that was used during the evaluation is the MSVD. The collection comprises 1970 videos from YouTube while the annotations of the sentences are provided by the owner of the organization. Additionally, the annotation process has been carried out using multilingual workers and the videos have been annotated in more than 20 languages. In this work, only the videos that have been annotated using the English language were used, counted to 1517. Each video is described with an average of 22 captions and their duration is between 10 to 25 seconds. Due to the fact that some videos are no longer available for download, the total number of videos that was used is equal to 931. However, we follow the splitsusing same percentage-proposed by Venugopalan et al. [27]. Specifically, the training set consists of 60%, the validation  set 5% and the test set 35%. Finally, for each video 40 frames were sampled while the vocabulary consists of 5821 unique words.

B. Implementation details
For training both the baseline and the proposed approach, the selected number of epochs was 3000. Adam [28] was selected as optimizer with initial learning rate and weight decay 10 −4 and 10 −5 , respectively. The learning rate was scheduled to be decayed by 0.8 every 200 epochs. Moreover, the batch size was selected to be equal to 512. We compared the models that achieved the min loss value on the validation set. Finally, all implementations were carried out using Py-Torch [29] library on an Nvidia GTX 1070 GPU with 8GB memory.

C. Evaluation metrics
For the evaluation, commonly-used metrics on the video captioning domain were used. Specifically, both baseline and the proposed models were evaluated using METEOR [30], BLUE@1-4 [31], ROUGE-L [32] and CIDEr-D [33] metrics.

D. Evaluation results
In order to transform the visual input to feature vectors, the features have been extracted using two different networks as feature extractors, Inception-v4 [34] and ResNet-152 [35]. The dimensions of the extracted features are 40 × 2048 when ResNet-152 is used and 40 × 1536 when Inception-v4 is used. Both networks were pre-trained on the ImageNet [13] dataset. 40 frames are sampled from each video. Due to the fact that each video has a duration of 10 to 30 seconds, at least one frame for each second of the video has been taken into account when the MSR-VTT dataset is processed. The impact of the number of clusters has been also investigated. Specifically, the number of clusters in the experiments that we have conducted was 10, 20 and 100. Furthermore, the value of λ was selected experimentally. More specifically, the experiments were carried out using multiple λ values equal to 1.0, 0.7 and 0.3.
In Table I the results of the different experimental settings using Inception-v4 as feature extractor are presented. From a detailed examination of the provided results, it can be seen that the proposed method performs better compared to the baseline approach. Specifically, the proposed method performs an improvement of 44%, 10%, 12%, 30% when evaluated using Blue@4, METEOR, ROUNGE-L and CIDEr-D respectively. More specifically, the proposed method exhibits better results when the number of clusters is low, 10 or 20. This happens, to the best of our knowledge, because the introduced loss function works as a global penalty/reward in combination with the existing cross-entropy loss. Thus, a small number of clusters leads the proposed model to minimize the introduced loss value that explicitly describes a global assignment of words to clusters. Additionally, in Table I the optimum value of the factor λ can be observed. With λ set to 0.3 the proposed method exhibits significantly better results. This improvement of the performance is expected when a small value of λ is used. As mentioned, the introduced loss function describes a global sentence loss and, therefore, a higher value leads the model to learn more abstract sentences. It should be noted that this fact is penalized during evaluation.
In Table II, the evaluation results using ResNet-152 as feature extractor are depicted. The experiments were carried out using the same configuration as the experiments where the Inception-v4 was used as feature extractor. As it can be seen, the proposed method outperforms the baseline approach. Specifically, when the factor λ is equal to 0.3 and the number of clusters is low, 10 or 20, the proposed approach exhibits a significant improvement compared to the baseline approach. More specifically, the proposed method improves the results of Blue@4, METEOR, ROUNGE-L and CIDEr-D by 37%, 11%, 10%, 30%, respectively.
The results of Tables I and II prove that the proposed method is agnostic to the feature extractor used. Additionally, a low value of the contributing factor λ to the overall cross entropy loss is more efficient. Moreover, the number of clusters that the vocabulary is assigned must be small. A detailed analysis of the experiments, shows that the generated cluster vectors  should be no greater than the max length sequence of captions, which in our experiments is equal to 28. This happens because a large number of clusters generates sparse cluster vectors that subsequently increase the introduced global loss and make the minimization problem more difficult. Consequently, this leads the model to generate more abstract sentences. Furthermore, in Fig.2 an indicative result of the proposed method compared to the baseline approach is presented. The most relevant caption of the 20 ground truth captions is presented. Specifically, on the first row, both the baseline and the proposed approach perform a satisfying caption prediction. Moreover, the rows two and three represent promising and not satisfying results, respectively. In order to verify the robustness of the proposed method, experiments have been carried out on an additional dataset. The best configuration settings using the MSR-VTT dataset have been selected. The factor λ is set equal to 0.3 and the number of clusters equal to 10, 20 and 100. In Table III the performance of the proposed method using the MSVD dataset is depicted. As it can be observed, the proposed method outperforms the baseline approach in all cases. The proposed approach improves the performance significantly when the number of clusters is 20. More specifically, the proposed method increases the performance by 17%, 4%, 5% and 14% in terms of Blue@4, METEOR, ROUNGE-L and CIDEr-D, respectively. V. CONCLUSION In this work, a novel loss function is proposed in order to improve the performance of video captioning techniques using the textual information. In particular, a supervision mechanism for guiding the video captioning learning process, by taking into account the video captions' similarity in correspondence with the visual content was proposed. More specifically, the proposed method makes use of the textual information in the form of cluster vectors so as to perform a kind of global sentence similarity. It is proved that the proposed approach is agnostic of the feature extractor that may be used. Furthermore, the introduced loss function not only penalizes the captions that have been miss-classified on predefined clusters, but also rewards the captions that are predicted correctly.
The experimental results also demonstrate that the optimal number of clusters depends on the length of the dataset's vocabulary. Future work will include investigation of end-toend architectures that could generate clusters while the video captioning problem is resolved and vice-versa.