Efficient Modeling of Future Context for Image Captioning

Existing approaches to image captioning usually generate the sentence word-by-word from left to right, with the constraint of conditioned on local context including the given image and history generated words. There have been many studies target to make use of global information during decoding, e.g., iterative refinement. However, it is still under-explored how to effectively and efficiently incorporate the future context. To respond to this issue, inspired by that Non-Autoregressive Image Captioning (NAIC) can leverage two-side relation with modified mask operation, we aim to graft this advance to the conventional Autoregressive Image Captioning (AIC) model while maintaining the inference efficiency without extra time cost. Specifically, AIC and NAIC models are first trained combined with shared visual encoders, forcing the visual encoder to contain sufficient and valid future context; then the AIC model is encouraged to capture the causal dynamics of cross-layer interchanging from NAIC model on its unconfident words, which follows a teacher-student paradigm and optimized with the distribution calibration training objective. Empirical evidences demonstrate that our proposed approach clearly surpass the state-of-the-art baselines in both automatic metrics and human evaluations on the MS COCO benchmark. The source code is available at: https://github.com/feizc/Future-Caption.

Figure 1: Overview of conventional image captioning, refinement-based image captioning, and our future context modeling with causal dynamics calibration from nonautoregressive decoder.Note that the non-autoregressive decoder is not involved at the inference stage to maintain computation efficiency.

INTRODUCTION
Image captioning, which aims to describe the image content with natural language, has seen rapid development in the past several years [6].In a conventional image captioning system, an visual encoder first transform the given image into a sequence of intermediate hidden representations, based on which, a language decoder generate the sentence word by word.Such encoder-decoder paradigm is usually implemented by CNN-LSTM [53,58] or Transformer [51] network architecture, and optimized with teacher forcing objectives [3,9,22,24,40,63].Despite its success, the autoregressive structure of the left-to-right manner makes the model only access to the local context, i.e., the previously generated words as well as the given image, at each decoding step.Such a unidirectional property makes the models unable to exploit global context effectively, yielding an unsatisfied description.[54,55,67].
To address this issue, many researchers have attempted to exploit the global information during sentence generation.Typically, refinement-based method is introduced [29, 47-49, 56, 57, 59, 62], which typically consists of two networks: the first network is usually a primary generator or an image-text retrieval, which is used to generate or retrieve a coarse related template; A refiner is in series to generate the final caption by attending to the sentence produced before.Such an iterative refinement operation can help the model to look at both past and future semantic context and thus improve decoding at every time step.However, most of the works rely on multi-pass decoding or specially customized decoding algorithms which leads to a significant increase in training and inference costs.On the other hand, modeling the global context in the reverse direction by pairing the conventional left-to-right image captioning model with a right-to-left auxiliary model is also delivered [11,48,50,55].However, in these methods, the modeling of reverse context is still conditioned on the local context with a separate network and they cannot sufficiently encourage the image captioning model to exploit a truly flexible global context.
In pursuit of effectively and efficiently incorporating global information into image captioning models, we conduct well-designed pilot experiments, and find some interesting phenomenons: i) Even conditioned on absolutely correct context, i.e., historical words and given image, there is a certain proportion of ground-truth words that the image captioning model predicts with relatively low probabilities.ii) The probability assigned to the ground-truth words from image captioning models differs with the captioning length.In contrast, with the factorized probability modeling, a good image captioning model should endow the highest probability to correct words according to accurate historical information.Consistent with [19,66,67], we believe that the reasonable cause of this phenomenon is that the image captioning model cannot confidently predict these words according to only the local context.Therefore, it should be improved for the image captioning model on these unconfident words with sufficient distribution calibration.
In this paper, we introduce efficient modeling of future context information for image captioning, referred to as FutureCap.In general, the architecture of the original autoregressive image captioning (AIC) model is kept untouched and jointly optimized with an additional mask-based non-autoregressive image captioning (NAIC) model [17,18,20], which is essentially cross-modal understanding and contains the global context.As shown in Figure 1, the AIC model and NAIC model are first trained combined in multi-task learning with sharing the visual encoders.The visual encoder is additionally supervised by the signal from the NAIC decoder to include sufficient future context information; Then, we employ casual dynamics calibration that pushes the student AIC model faithfully learn the causal effect of teacher NAIC model representations on unconfident output, with cross-layer interchanging aligning.This further help the AIC model leverage knowledge to information dynamics.Experimentally, we evaluate our approach to the MS COCO dataset.According to both automatic metrics and human evaluations, the captioning models equipped with future context modeling evidently outperform baselines.The major contributions of our paper are as follows: • We focus on the efficient modeling of future information for better image caption decoding and analyze the necessity of global context clearly with pilot experiments.• We introduce causal dynamics calibration that encourages the student AIC model to learn the interchange aligning from teacher NAIC model on unconfident words and adjust knowledge routing with share visual encoder, to more effectively exploit the future contextual information.
• Experiments on the MS COCO dataset demonstrate that image captioning models equipped with our future context modeling framework significantly outperforms the one without it.More encouragingly, as most of the previous literature improves the performance by increasing the model capacity, our approach represents a new optimization paradigm that leads to no additional inference cost.

BACKGROUND AND PILOT ANALYSIS
To investigate the potential impact of future context in image captioning, we first describe the basic architectures of conventional autoregressive and non-autoregressive image caption models, both of which follows on Transformer-based encoder-decoder paradigm.
After that, we conduct pilot experiments as well as empirical analyses on the effects of context information for caption decoding.

Model Architecture
Generally, AIC and NAIC models hold the same visual encoder architecture while differing in decoders for their mask matrix of self-attention mechanisms and prediction manners.
Visual Encoder.The visual encoder aims to learn the highlevel visual representations of the given image, which includes  same network layers.Each layer consists of two sub-layers: a self-attention sub-layer and a position-wise feed-forward network sub-layer.The input of the network layer is the hidden states of the previous layer, on which the multi-head scaled dot-product attention computation is performed.Assuming that ℎ   presents the hidden states of the -th encoder layer, the visual encoder layer can be computed as: Layer normalization with residual connection is added after both two sub-layers.Note that ℎ 0  is initialized as the patch embedding of the extracted image region features, and the hidden states of the -th layer ℎ   are served as input to the language decoder.Language Decoder.The decoder of the AIC and NAIC models are introduced separately.For the autoregressive decoder, which usually consists of three sub-layers: a masked self-attention sublayer, a cross-attention sub-layer, and an FFN sub-layer.In particular, to maintain the autoregressive generated property at each time step, the masked self-attention sub-layer performs self-attention with a causal attention mask to prevent the decoder from seeing subsequent words.To generate the hidden states ℎ   of the -th decoder layer, the autoregressive decoder can be formulated as: Layer normalization with residual connection is also added after each sub-layers.Finally, with the given image , the generated sentence  < and the learned top-layer hidden states ℎ , , the decoder models the probability distribution as:  where  denotes the learnable parameter.
For the non-autoregressive decoder, which aims to predict a set of masked target words   given an image  and a set of observable target words   .The NAIC decoder also contains the same  identical layers, each of which also includes a self-attention sub-layer, a cross-attention sub-layer, and a feedforward sublayer.Unlike the masked self-attention sub-layer of the AIC decoder, the attention mask is removed in the NAIC decoder.Finally, with the learned top-layer hidden states h of the NAIC decoder and partially observed sentence   , the predicted probability distribution for every masked word   ∈   can be calculated as: where W is the learnable parameter.Note that since the decoder of the NAIC model takes   rather than  < as input, which includes both history and future words with respect to every masked target word, it should embody the global contextual information.

Are History Contexts Enough for Prediction?
A high-quality image captioning model is supposed to endow the highest probabilities of the ground-truth words based on correct historical context.Delicate experiments are conducted in this section to explore the disadvantage of conventional transformer image captioning with a teacher-forcing training framework.
Experimental Setting.Experimentally, we adopt the basic configuration of the Transformer-based image captioning model without the mesh-memory module, which is publicly available at GitHub 1 .To be specific, the model is comprised of 6 standard transformer layers of visual encoder and language decoder.Moreover, the regional features of images extracted from faster r-cnn [44] on the backbone of ResNet [21] are utilized to retrain the image captioning model with the current configuration [51].For the training process, the image captioning model is first trained with cross-entropy loss and then fine-tuned with sentence-level self-critical reward [46] follow the default training settings with Adam [30] optimizer on the MS COCO [6] dataset of Karpathy training split.
After obtaining a fully-trained image captioning model, we record the predicted probability of the ground-truth words given the correct context including image and the previous subsentence in the MS COCO training set.In order to characterize the results, we plot the proportion of total words in different predicted probability chunks in Figure 2. Meantime, we also plot the average predicted probability of the corresponding ground-truth words for different caption lengths in Figure 3.Here we normalized the caption to eliminate the influence of absolute sentence length.
Results Discussion.According to the estimation results of Figure 2, we can find that even provided with the totally correct context, there is an obvious portion of ground-truth words that the conventional image captioning model predicts with relatively low probabilities.For instance, the model predicts 25.67% ground-truth words with probabilities chunk between 0.0-0.1.The reasonable cause of this phenomenon is that image captioning model cannot confidently predict these ground-truth words according to only the local context of image and history words.
To further prove where the low confidence words locate, we calculate the average predicted probability for various locations in the total length.We can see that with the relative length increase, i.e., the generation position moves from left to right, and the average probability of ground-truth words increases gradually.We attribute that with the generated sentence length increase, the determined context increase, and future context becomes less, which results in the strengthening of model confidence.According to the experimental results, it is natural to consider improving the image captioning model on these correct while unconfident words with effective future information to assist the current decision.

Shared Visual Encoder Supervision
To encourage the visual encoder containing sufficient global information, we first train the image captioning model and NAIC model with shared visual encoder in a multi-task manner and optimize the combined training objective as follows: where    ,   and    denote the parameters of the shared visual encoder, the autoregressive language decoder, and the maskbased non-autoregressive decoder, respectively. is a balancing factor between two losses.As the visual encoder is additionally supervised by the signal from the mask-based NAIC decoder, the AIC model is able to disentangle the future information from the extracted visual representation.In between, the AIC model is first optimized through the time-wise cross-entropy loss: And then fine-tuning using with CIDEr score reward  and mean baseline .The gradient expression for SCST [46] training is, For NAIC decoder, we adopt the strategy in [18].Concretely, we randomly select  words, and replace each selected word with a special symbol [], splitting sentence  into observed set   and masked set   .We eventually minimize the following training objective for each word in masked sentence   as:

Causal Dynamics Calibration
We then introduce to use NAIC model as a teacher to transfer the knowledge for the student NAIC model on unconfident words decision, i.e., help the NAIC model to capture and consider more global information from visual representation and generated words.The parameters of the teacher NAIC model are frozen in this stage.Figure 4 depicts the training procedure of this stage with an easy understanding example.Formally, given the image  and the generated ground-truth words  < at each time step , we first ask the AIC model make predictions for every word using Equation 6, generating the prior word-level probability distributions { 1 , where (•) is the mapping function for sampled neurons except the future neurons in teacher NAIC model.Similar to the conventional knowledge distillation [23], we also restricted the output distribution for unconfident words with KL-divergence as: The final training objective for the student AIC model is a combination of the three terms reviewed above as: where the last term makes the AIC model stable to the high confidence ground-truth words.By doing so, we can fully strengthen the ability of the AIC model to leverage the global context contained in the NAIC.On the other hand, to avoid making the student model rely heavily on the decision of the teacher NAIC model, we also employ a teacher annealing strategy to linearly decrease the knowledge distillation to ground-truth supervision of sentencelevel reward [46] throughout training.Note that the NAIC model is not involved at the inference stage to keep efficiency.

EXPERIMENTS 4.1 Experimental Preparation
Dataset.We evaluate the proposed method on the MS COCO [6], which is a standard estimation benchmark for image captioning tasks.To be consistent with previous work [9,24], we adopted the Karpathy split [28] that contains 113,287 training images and 5,000 images for validation and test splits, respectively.Each image corresponds with 5 different captions.We omit words that occur less than 5 times and the vocabulary size is 10,369 words.Image features are extracted with CLIP [3] for 512-dim vectors.
Implementation Details.Our implementation is based on Pytorch and repository [9] to build the model under Transformer-base configuration with the memory module, where AIC and NAIC hold the identical architecture.Concretely, both the network is comprised of 6 visual encoder and 6 language decoder layers, each with 512 as hidden size, the FFN sublayers of 2,048 dimensions, and 8 heads in multi-head attentions.We set the dropout rate to 0.1.The neurons are sampled from the top-layer decoder with a unifying distribution.For parameter updating, we employ the Adam optimizer [30] with the default setting.As for learning rate schedule, we adopt the same strategy as [9,51] and set warm-up steps to 4,000.In the first stage, we train all models by sharing their encoders for 300k steps.In the second stage, we separate their encoders and fix the parameter of the NAIC model.Then, the AIC model is solely optimized with the fixed NAIC by additional 200k steps.

Comparison with State-of-the-Art Models
Performance on MS COCO.We compare the results of our Fu-tureCap model with those of several recent image captioning models trained without large-scale vision-and-language pre-training on the offline MS COCO dataset.The evaluation results are listed in Table 1.First, we can see that FutureCap surpasses the original What's more, as most of the previous literature has boosted caption quality by increasing the model capacity, which leads to an extra burden for application devices, our approach represents an outliner in this trend and demonstrates that state-of-the-art CIDEr levels can be obtained even with a very lightweight efficient model.
Online evaluation.We also report the performance of our method on the online MS COCO test server.In this case, we employ an ensemble of four models trained with the same configuration of NAIC decoder assistance, for which ground-truth annotations are not publicly available.Comparison results with the top-performing approaches of the leaderboard are reported in Table 2.As it can be seen, our method surpasses the current state-of-the-art model on all metrics, achieving an advancement of 1.3 CIDEr points with respect to the best performer.

Model Analysis
Ablation Study.To better understand the influence of each design in our FutureCap model, we conduct ablation studies on the offline MS COCO dataset.Table 3 reports the evaluation results on the testing set.We first validate the necessity of shared visual encoder supervision by training the AIC and NAIC model using the separate visual encoder, respectively, denoted as "w/o.VES".The automatic metrics of final performance decrease by a 1.1 CIDEr score.This shows that the visual encoder of the AIC model benefits a lot from the global supervision information through joint training with the NAIC decoder.As for "w/o.CDC", which means not performing causal dynamics calibration on any target words at the fine-tuning stage, its performance also decreases, e.g.0.4 for the BLEU-4 score and 1.5 for the CIDEr score.Moreover, to illustrate the superiority of CDC, we replaced it with the conventional knowledge distillation, i.e., remove the   in Equation 14, the results show poor performance.These demonstrate the effect of each part in our design as well as incorporating specific future context into the image captioning model on its unconfident words.
Effect of Hyper-parameters  and .There exist important hyper-parameters in the FutureCap framework that we need to tune on the validation set to achieve a good performance, i.e., the balancing factor  in Equation 8 and the confidence threshold  for determining mask words set   .To balance the training of the AIC model and NAIC model at the pre-training stage, we try to select the optimal  that can bring steady improvements to the AIC model.Specifically, we gradually vary  from 0.5 to 1.0 with an increment of 0.1 and evaluate the performance on the validation set, the evaluation results are as shown in Figure 5.We can see that the final image captioning model achieves its peak when  = 0.7.Hence,  is set to 0.7 by default.Given the selected , at the fine-tuning calibration stage, we also analyze the impact of  on the validation  set.Practically, we change the value of  from 0.0 to 0.3 with an interval of 0.05.As shown in Figure 6, the AIC model performs the best when the  comes to 0.2.Therefore, we set  = 0.2 as the confidence threshold for the causal dynamics calibration stage.
Effect of Mask Selection Strategy.In our future context modeling framework, for each generated caption pair, we adopt causal dynamics calibration to transfer the knowledge of the NAIC model into the AIC model only on the masked word set   .The set is determined by masking words whose AIC-predicted probabilities of the corresponding ground truths are lower than a pre-set threshold .It is natural to question if exists other masked word selection patterns and how they perform.Therefore, we further investigate the following four variant masking methods: • Random: For the given sentence , randomly select  words to be masked and input to the teacher NAIC model to conduct causal dynamics calibration accordingly.• Highest: As a contrast, we mask the generated words from the AIC model whose predicted probabilities of the ground truth words are higher than the preset threshold .• Wrong: Since the ground-truth labels are given, we try to mask the words where the highest probability predictions of the AIC model are different from the corresponding labels.• OnlyOne: In this variant, to illustrate the necessity of selectively distilling knowledge on a portion rather than all of the target words, we generate NAIC-predicted probability distributions for all target words.As an extreme case, we iteratively and only mask one word at once with the given image and residual sentence as input to the NAIC model.The evaluation results for different masked word selection strategies are presented in Table 4.We can observe that: 1) For "Random" and "Highest" masking strategies, both variants are inferior to our threshold-based causal dynamics calibration method.In particular, the results of "Highest" indicate that conducting dynamics calibration on the confident words is less effective, resulting in a decrease of 0.6 CIDEr score.Meantime, the heuristic selection of masked words is necessary rather than randomness; 2) The result of "Wrong" is lower than our approach.It may be due to the distribution difference between low confidence and incorrect predicted words.3) "OnlyOne" represents one approach to generating NAICpredicted probability distributions for all target words iteratively.It is also reasonable for "OnlyOne" to obtain a worse performance since some words can be easy to generate conditioned on local context and over-calibration hold some side effects.All these results demonstrate that it is crucial for the AIC model to exploit the global context on its unconfident words.At the same time, a more advanced and learnable selection strategy may contribute to better captioning performance.
Influence on Model Confidence Distribution.As the prior experiments show one drawback of conventional image captioning lies in the low confidence to correct words, here we also investigate the change of image captioning model confidence with respect to ground-truth words on the MS COCO training set with future context modeling.We list the percentage of words within each interval, in terms of AIC-predicted probability in Table 5.As the probability higher than 0.5 must be the maximum across the vocabulary, we chunk 0.5-1.0 as a high-confidence interval while the others are subdivided into low-confidence intervals.According to the evaluation results, it is obvious that the number of words in low-confidence intervals drops with FutureCap.For instance, the number of words located in [0.1, 0.2) becomes 0.35% fewer.It indicates that our Fu-tureCap model becomes more confident about the ground-truth words given the accurate context.

Case Study
In order to qualitatively show the effectiveness of future context modeling, we showcase several generated image description results from conventional Transformer with mesh-memory and our FutureCap model, as well as the human-annotated ground-truth sentences (GT) in Figure 7. Generally, it is easy to see that both approaches are able to produce linguistically coherent descriptions.Nevertheless, when examining the fine-grained image content, our future information incorporated method produces more accurate

RELATED WORKS
Image Captioning.In recent years, a large number of neural systems have been proposed for the image captioning task [3,9,16,22,24,40,53,58].The state-of-the-art approaches depend on the encoder-decoder framework to translate the image into a descriptive sentence.Specifically, the encoder network computes visual representations for the image and the decoder network generates a target sentence based on the visual representations.To allow more effective use of the visual representations, a series of attention models have been proposed and achieved great success in multiple sequence-to-sequence learning tasks [4,38].In recent years, Transformer-based architectures [9,14,17,26,32,40,59] are introduced to replace conventional RNN, achieving new stateof-the-art performances.On the other hand, lots of mask-based non-autoregressive decoding methods are studied for inference acceleration with a global perspective [13,15,17,18,20].However, as far as we are concerned, improving the original language decoding with supervised future information from the NAIC decoder has never been studied in image captioning, which pushes forward our exploration in this paper.
Training Procedure.Training strategy for image captioning models usually follows the word-level cross-entropy paradigm from left to right.This was later combined with a fine-tuning phase based on the application of the REINFORCE method, to allow use as optimization objectives captioning metrics directly [36,46], boosting the final performance.As a strategy to improve both training phases, in [25] it is proposed to exploit a teacher model trained on image attributes to generate additional supervision signals for the captioning model.These are in the form of soft labels, which the captioning model has to align within the cross-entropy phase, and re-weighting of the caption words to guide the fine-tuning phase.[5] improve the quality with the interaction of two interconnected language models that learn from each other.Additional improvement to the performance of recent self-attention-based image captioning approaches is due to the use of large-scale vision-and-language pre-training [8,33,43,62,65], which can be done on noisy and weakly annotated image-text pairs, also exploiting pre-training losses different from cross-entropy, such as the masked word loss [62].Different from all previous methods, our approach is based on the assistance of an additional non-autoregressive image captioning model that is trained with the muti-task learning and dynamic distribution calibration, without changing the internal model architecture or relying on a prior large-scale pre-training model.Future Information Incorporation.There are numerous works [7,11,12,27,39,42,45] dived to exploit the future information to boost the performance for sequence-to-sequence learning.However, their modelings are different from ours.Specifically, to exploit the future information, [7] adopt a fine-tuned BERT [10] to encode the words that will be generated in the future to acquire the global cost and then exploited as extra supervision to guide the current word generation.[1,64] employ an extra teacher network to help the neural machine translation model capture global information with knowledge distillation.For [11,42,45], given the previous history, in addition to the current target, they further predict the future words, i.e., [11,42] one more step ahead, and [45] the rest of the sequence.[39] only consider the current target to model the future information and [48,50,55] regularize the right-to-left generation, while we directly leverage the effective knowledge to enhance the modeling of the future information.The most similar work is [67], both works point the importance of bi-direction context and employ it for improved image captioning.In contrast, [67] introduce a compact directional transformer to parallel decoding while we devise causal dynamics calibration without extra parameters.

CONCLUSION
In this paper, we focus on making a conventional image captioning model to effectively exploit the global context without any extra inference cost.Specifically, we resort to the mask-based nonautoregressive decoder for future information modeling during training.Specifically, we introduce multi-task learning to benefit the AIC model by sharing its visual encoder with an auxiliary NAIC.Next, we explore distilling a teacher NAIC model by training the AIC student model to capture the causal dynamics for unconfident words.Experimental results on the MS COCO dataset show that our future information incorporation framework can significantly improve the captioning performance.More importantly, no additional carefully designed network is needed and only the original image captioning model is involved during inference.

Figure 2 :
Figure 2: Predicted probability of the different ground-truth words on the training set of MS COCO dataset.

Figure 3 :
Figure 3: Average predicted probability of the ground-truth words for the different normalized sentence length on the training set of MS COCO dataset.

Figure 4 :
Figure 4: Illustration of future context modeling with non-autoregressive decoder for language decoder.Assuming that the confidence of predicted word { 4 } from language decoder is lower than threshold , the masked words set   becomes { 4 }.Thus the input   to the NAIC decoder is { 1 ,  2 ,  3 , [],  5 ,  6 }, and the output hidden states h 4 and probability  4 is used to calibrate the causal dynamics of original  4 .Note that the total parameters of the NAIC model is freezed.

Figure 5 :
Figure 5: The evaluated CIDEr scores according to the combine training stage on MS COCO offline test set with different , the balancing factor in different loss.

Figure 6 :
Figure 6: The evaluated CIDEr scores on the MS COCO offline test set with different value of , the confidence threshold for masked words.

Figure 7 :
Figure 7: Case studies of original Transformer and our Fu-tureCap model, coupled with the corresponding ground truth sentences (GT).
and fluent descriptive sentences by exploiting global information for different word predictions.For example, plain Transformer generates the phrase on a bicycle which is inconsistent with the visual relationship for the second image, while the words next to a bicycle in our model depicts more precisely.This again confirms the advantage of capturing global context when applying the proposed FutureCap method.
2 , ...,  | | } and | | is the sentence length.Then, the masked word set   is built, where the predicted probabilities   to the corresponding ground-truth words are lower than a threshold value  as:  = {  |  ≤ , 1 ≤  ≤ | |}.Next, we obtain the observed set   for NAIC model input by replacing those selected low-confident ground-truth words in the original sentence  with a special symbol [].Note that the equation  =   ∪   is always true in our framework.Next, we can obtain the predicted probability distribution q from the teacher NAIC model for every word in   using Equation7.Once get the knowledge routing of the AIC and NAIC model, to improve the decision on the set   of its unconfident words, we introduce causal dynamics calibration to assist the future context Algorithm 1: Causal Dynamics Calibration Algorithm between student AIC model and teacher NAIC model Input: Unconfident masked data   , student AIC model   with output neural  , teacher NAIC model    , neural alignment  1 Fix the parameters of    ; 2 while not converged do 3 for   in   do   = (  ); || GET(  ,   ,   ) -GET(   ,   ,   ) || 2 2 ; 5 8 Compute KD loss KL(  ||  ) ; 9 Compute combined loss   ; 10 Loss backward; 11 Step optimizer; 12 end 13 end modeling of AIC model.The detailed progress is shown in Algorithm 1.In between, GET operation is defined as an activation-value retriever for a neural model.Given a model  contain a set of neurons  , i.e., internal representations, and input context   including image  and generated word  < , GET(,  ,   ) is the set of weight values that neural  takes on when processing the context   .For context   ,   is the set of neuros from student AIC model, we can get the interchanging alignment loss as:   (   ,   ) = ∑︁   ∈  ||GET(  ,  ,   ) − GET(   , ( ),   )|| 2 2 ,

Table 1 :
Performance comparisons of our FutureCap model and other state-of-the-art image captioning models with different evaluation metrics on the MS COCO Karpathy test set.All values are reported as a percentage (%).

Table 2 :
Leaderboard of different image captioning models on the online MS COCO test server.

Table 3 :
Ablation studies on the MS COCO test set.
memory-incorporated Transformer by +0.6 BLEU-4 and 7.1 CIDEr scores, respectively, verifying that modeling the future information brings a significant performance improvement.Next, it is encouraging that our proposed framework outperforms the most recent competitive models.As it can be observed, our proposal reaches 136.3 CIDEr points, beating almost all the compared approaches.It is encouraging that our strategy can be combined with other advance improved strategies as well as untouched the internal structures.

Table 4 :
Performance comparisons of incorporating different distribution calibration on the MS COCO test set.

Table 5 :
The percentage of words within each probability interval on the training set.