A Stepwise, Label-based Approach for Improving the Adversarial Training in Unsupervised Video Summarization

In this paper we present our work on improving the efficiency of adversarial training for unsupervised video summarization. Our starting point is the SUM-GAN model, which creates a representative summary based on the intuition that such a summary should make it possible to reconstruct a video that is indistinguishable from the original one. We build on a publicly available implementation of a variation of this model, that includes a linear compression layer to reduce the number of learned parameters and applies an incremental approach for training the different components of the architecture. After assessing the impact of these changes to the model's performance, we propose a stepwise, label-based learning process to improve the training efficiency of the adversarial part of the model. Before evaluating our model's efficiency, we perform a thorough study with respect to the used evaluation protocols and we examine the possible performance on two benchmarking datasets, namely SumMe and TVSum. Experimental evaluations and comparisons with the state of the art highlight the competitiveness of the proposed method. An ablation study indicates the benefit of each applied change on the model's performance, and points out the advantageous role of the introduced stepwise, label-based training strategy on the learning efficiency of the adversarial part of the architecture.


INTRODUCTION
Recent advances in video capturing and storage technology and the widespread use of social networks (e.g. Facebook, Twitter), video sharing platforms (e.g. YouTube) and online video archives, facilitate the recording and sharing of huge volumes of video content. Thousands of hours of video are uploaded every single day on the Web, aiming to attract the viewers' attention. Nevertheless, in several cases, browsing through long videos to find the content that a viewer prefers is a highly time-consuming and tedious process. Hence, the provision of a concise summary that adequately conveys the main concept of the video, enables the viewer to quickly grasp an idea without having to watch the entire content. Given the plethora of videos on the Web and the limited time spent by viewers on deciding whether to watch or skip a video, an effective video summary allows time-efficient browsing of large video collections and increases the potential of a video to be consumed.
Video summarization aims to provide a short visual summary that encapsulates the flow of the story and the essential parts of the full-length video. The application domain is widely extended and includes the use of such technologies by video sharing platforms that aim to higher viewer engagement and content consumption, and the content management systems of media organizations to allow effective indexing, browsing and retrieval of video content. Moreover, video summarization that takes into account the diversity of the current content distribution environment, enables effective sharing of video content across different channels (e.g. 3G/4G/5G WANs, local LANs, etc.) and presentation devices (e.g. desktops, laptops, tablets, smart-phones), in forms (storyboards, skims, excerpts) that are tailored to the needs of each viewer, thus facilitating content presentation and consumption.
Several methods aimed to tackle the task of video summarization, and deep learning approaches were the main focus of researchers over the last years. In this direction, a number of datasets were built to facilitate training and evaluation of video summarization algorithms. However, driven by the fact that video summarization is a highly-subjective task, we argue that supervised learning, which relies on the use of a single ground-truth summary, cannot fully explore the potential of deep learning architectures. The latter, in combination with the limited amount of available annotated data for training a video summarization algorithm in a supervised manner, directed our focus on improving the performance of an unsupervised method. Starting from the work of [16] and building on a PyTorch implementation of a variation of this model [3], we perform a thorough study with respect to parts and procedures that could be further fine-tuned for improving the models' performance. In particular, after evaluating the implemented modifications, namely the addition of a linear compression layer that reduces the number of trainable parameters and the application of an incremental training method for the model's components, we propose a stepwise, label-based learning approach for the adversarial part of the architecture. Experiments on the SumMe and TVSum benchmarking datasets showed that the proposed method, called "SUM-GAN-sl" in the remainder of the paper, exhibits significantly improved performance compared to the original one, and is highly competitive against other state-of-the-art methods. In a nutshell, our contributions include: • the evaluation of how the variations introduced by the developer of [3], i.e. the addition of the linear compression layer and the applied incremental process for training the architecture, influence the performance of the original model; • the proposal of a stepwise, label-based approach for training the adversarial module of the network in a more fine-grained manner, and the assessment of the advantage this update brought on the algorithm's efficiency; • a thorough study of the relevant literature that allowed to gather information about the utilized evaluation protocols and spot the differences in the assessment of state-of-the-art summarization algorithms; • experiments on the SumMe and TVSum benchmarking datasets, that resulted in estimates regarding the lower and higher bounds of summarization performance and the suitability of the used evaluation metrics.

RELATED WORK
Several approaches were proposed over the last couple of decades for addressing the task of video summarization, with the majority of them being trained supervisingly using ground-truth data. For the sake of brevity, here we report only on machine learning methods that exploit the learning efficiency of neural networks. A group of supervised algorithms were based on the use of Convolutional Neural Networks (CNN). For example, in [19] video summarization is addressed as a weakly-supervised learning problem and solved via a deep 3D convolutional neural network architecture that learns the notion of importance using only video-level annotation. [23] tackles video summarization as a sequence labeling problem and performs key-frame-based video summarization using fully convolutional sequence models. [6] combines a soft, self-attention network with a 2-layer fully connected network to process the CNN features of the video frames and compute frame-level importance scores that are used for key-fragment selection. [18] uses deep video features for encoding various levels of content semantics and a deep neural network that maps videos and their descriptions to a common semantic space. The latter is jointly trained with associated pairs of videos and descriptions and a summary is created by clustering the deep features extracted from the video segments. The effectiveness of Recurrent Neural Networks (RNN) (e.g. Long Short-Term Memory (LSTM) units [11] and Gated Recurrent Units (GRU) [4]) to capture the temporal dependency over sequential data led to several RNN-based supervised techniques for video summarization that represent the current state-of-the-art. [29] introduces the use of LSTMs to model temporal dependency among frames and compute frame-level importance scores. [32] proposes a 2-layer LSTM architecture where the first layer extracts and encodes data about the video structure and the second layer uses this data to define the key-fragments of the video. This work is extended in [33] to exploit the shot-level temporal structure of the video and compute shot-level confidence scores for producing a key-shot-based summary of the video. [30] describes a Dilated Temporal Relational (DTR) Generative Adversarial Network (GAN), where the generator contains LSTM and DTR units to exploit long-range temporal dependencies at different temporal windows, and the discriminator is trained via a 3-player loss to distinguish between the learned summary and a trivial summary consisting of randomly selected frames. Finally, a number of works focus on introducing attention mechanisms in the network's architecture, to identify the most suitable parts and build the summary e.g. [7,8,12].
Besides the aforementioned supervised approaches, a few unsupervised methods were proposed as well. [16] addresses video summarization by training a deep network to minimize the distance between videos and a distribution of their keyframe-based summarizations, through a generative adversarial framework. [27] follows a similar approach, and aims to maximize the mutual information between the summary and the video using an information-preserving metric, a trainable couple of discriminators and a cycle-consistent adversarial learning objective. [34] formulates video summarization as a sequential decision-making process and develops a deep summarization network that learns to produce diverse and representative video summaries via reinforcement learning and a novel reward function. [31] suggests an approach that extracts key motions of appearing objects in the video, and learns to produce a fine-grained object-level video summarization in an unsupervised manner. The authors of [23] describe an unsupervised variation of their model, that aims to increase the visual diversity of the selected key-frames. Finally, [22] introduces a new formulation to learn video summarization from unpaired data. Sports highlights, movie trailers and other professionally-edited summary videos available online are collected and used to guide an adversarial process that learns a mapping function of a raw video to a human-like summary.

PROPOSED APPROACH
The starting point of our work was the unsupervised method of [16]. The core idea of Mahasseni et al. was to build a keyframe selection mechanism (to generate static video summaries) by minimizing the distance between features extracted from the selected key-frames The SUM-GAN architecture has been extended by a linear compression layer that reduces the size of the feature vectors. In addition, the model's components are trained incrementally; and, the GAN part of the architecture is trained in a stepwise and label-based manner. and the entire video. For this, a deep representation of the entire video frame sequence is created with the help of a bi-directional LSTM, which assigns a weight to each frame, and a variational auto-encoder (VAE). The former is used to capture the long-term dependencies over sequences of frames in both forward and backward direction. The latter is used to reveal the underlying structure of the frame/keyframe features (in its encoding part) and produce another representation of the video by drawing samples from the computed latent space (in its decoding part). The difficulty in defining a suitable threshold regarding the similarity between the reconstructed and the original video, directed Mahasseni et al. to the adversarial framework and the integration of a trainable discriminator network. The ultimate goal of this approach was to jointly train the frame selector and the variational auto-encoder in order to maximally confuse the discriminator, i.e. decrease discriminator's confidence in distinguishing the original from a reconstructed video, a condition that denotes a highly representative keyframe collection.
Building on this method, we gained deeper knowledge about the components of the SUM-GAN model and explored the possibility of improving its performance by fine-tuning specific parts of the architecture and the training process. For this, we were based on a publicly available PyTorch implementation [3], that was used for evaluating the performance of a variation of SUM-GAN on the summarization of 360°videos (see [15]). This variation contains a linear compression layer right before the summarizer of the architecture. In the updated model (see Fig. 1), given a video of M frames and focusing on the t th frame of this video, x t represents the CNN feature vector, x ′ t denotes the compressed feature vector, s t refers to the computed importance score from the frame selector, w t corresponds to the weighted feature vector (s t ⊗ x ′ t , where ⊗ denotes element-wise matrix multiplication), andx t relates to the reconstructed feature vector by the variational auto-encoder. In addition to the added linear layer, this variation follows a 3-step incremental training approach that updates specific parts of the network in each step. In particular, differently to the immediate update of the entire model based on the computed losses after a single forward pass of the architecture (see Alg. 1 in [16]), the implemented process: • performs a 1 st forward pass over the entire model, computes the L reconst , L prior and L sparsity losses, and updates the frame selector, the encoder and the linear compression layer (top part of Fig. 2); • performs a 2 nd forward pass of the partially updated model, computes the L reconst and L GAN losses, and updates the decoder and the linear compression layer (middle part of The aforementioned losses are computed similarly to [16]: where φ (x') is the output of the last hidden layer of cLSTM for compressed feature vectors of the original video (x' = {x ′ t } M t=1 ) and φ (x) is the output of the last hidden layer of cLSTM for the feature vectors of the summary-based reconstructed video ( where e is the hidden (latent) representation of x, p(e) is used to fit the values of the latent variable e to the values of a standard Normal distribution with mean zero and variance one, x is the observed data, q(e|x) is the probability of observing e given x, and D KL denotes the Kullback-Leibler divergence. For efficient training we employ the re-parameterization trick proposed in [14].
where M is the total number of video frames and σ is the regularization factor, a tunable hyper-parameter of the model.
and cLST M (x p ) are probability scores (computed at the soft-max output of the discriminator) representing the discriminator's confidence when classifying the original video, the generated summary and the uniform summary respectively.
Given the above, we examined a different training strategy for the adversarial part of the model. The introduced learning approach was utilized in [21] for unsupervised representation learning with deep convolutional GANs, a method used for image generation. Driven by the effectiveness of this approach on training a network to generate realistic images from white noise, we transfer this methodology in our context. Our aim is to a find a better equilibrium point between the generator and the discriminator, which means a better reconstruction of the video from the combination of the weighted frames and the learned distribution of data by the  variational auto-encoder of the architecture. So, instead of using the L GAN loss of the original SUM-GAN model, we follow a labelbased approach, where label "1" is assigned to the original video and label "0" to the video summary. Given these labels, we introduce the following two losses: L ORIG is used to minimize the Mean Square Error (MSE) between the original video label and the computed probability when the discriminator is fed with the original video. Similarly, the L SUM is used to minimize the MSE between the summary label and the computed probability when the discriminator is fed with the summary-based reconstruction of the video. Based on these losses, the training of the discriminator is performed in a stepwise manner, as depicted in Fig. 3 (top part). First, we pass the compressed feature vectors of the original video (x ′ t , t ∈ [1,M]) through the discriminator (forward pass), calculate L ORIG and then calculate the gradients (backward pass). Secondly, we pass the original video through the summarizer to create the reconstructed video (x t , t ∈ [1,M]), forward the latter to the discriminator, calculate L SUM and then accumulate the gradients from both the original video and the summary-based reconstructed one, with another backward pass. With the gradients accumulated, we call a step of the discriminator's optimizer. This incremental process enables a more fine-grained computation of the discriminator's gradients (compared with the training policy used in SUM-GAN), and helps the discriminator develop higher discrimination efficiency, thus performing better during the classification.
For training the generator, we introduce the following loss: L G E N is used to minimize the MSE between the original video label and the computed probability when the discriminator is fed with the summary-based reconstruction of the video. By constantly trying to reduce the sum of L reconst and L G E N , the generator aims to confuse the discriminator and make the summary-based reconstruction of the video indistinguishable from the original one. The reasoning behind choosing the MSE loss instead of the commonly used Binary Cross Entropy (BCE) loss for training the GAN module of the architecture, resides in the fact that in vanilla GANs, the latent vector (random noise) is sampled independently of the training data. The original GAN has shown better performance with the BCE Loss, since it does not force the network to learn a non-meaningful representation between the noise vector and a ground-truth image. Instead, it helps the generator to produce more versatile outputs, taking into account only if the output is classified as real or fake. In our method, differently from typical GANs, the introduction of a variational auto-encoder alters the above approach due to the input no longer being a random noise latent vector, but an original video to be reconstructed and fed to the discriminator for comparison. Therefore, we choose the MSE as the loss function, since our method attempts to reconstruct the input and to not generate new samples. To validate this choice we performed a set of experiments and their findings are reported in Section 4.
Given the above described training strategy, the randomly generated summary used in the original SUM-GAN model to regularize learning of the discriminator is not needed any more in our variation. The authors of [16] claim that the use of the randomly generated summary enhances the discriminator's ability to distinguish between the original video and a summary-based reconstruction of it. Nevertheless, through this approach the discriminator learns to classify the random summary in the same class with the generated summary, thus restricting the discriminator's ability to make the distinction between an actual video summary and a randomly generated one. Based on this reasoning, we omit the use of a random summary for training our model.
After training, the components responsible for generating a summary for an unseen video are the linear compression layer and the frame selector. In particular, the CNN features of the video frames pass through the aforementioned components and an importance score is computed for each frame. Then, having the video fragmented using the KTS algorithm [20] (other approaches for video shot (e.g. [1]) or subshot (e.g. [2]) segmentation, could be used too), fragment-level importance scores are calculated by averaging the Full Paper AI4TV '19, October 21, 2019, Nice, France importance scores of each fragment's frames. Finally, the summary is created by selecting the fragments that maximize the total importance score, provided that the summary length does not exceed 15% of the original video duration, a convention adopted by several video summarization approaches (e.g. [12,24,29,34]). This latter step is performed by solving the following optimization problem: where N is the number of fragments, L is the length of the original video, 0.15 defines the upper limit for the summary duration, and given the i − th fragment of the video, a i is a binary value that indicates whether the fragment is selected or not, b i is the computed fragment-level importance score, and l i is the length of the fragment. The latter is the 0/1 Knapsack problem.

EXPERIMENTS 4.1 Datasets
We evaluate the performance of our method on the SumMe [10] and TVSum [24] datasets. SumMe includes 25 videos covering multiple events from both first-person and third-person view, while the video duration ranges from 1 to 6 minutes. TVSum contains 50 videos capturing 10 categories of the TRECVid Multimedia Event Detection dataset and the length of each video ranges from 1 to 5 minutes. In terms of ground-truth annotation, each video of SumMe has been annotated by 15 to 18 viewers/users in the form of keyfragments, and thus it is associated to multiple fragment-level user summaries. Moreover, besides the aforementioned user summaries, a single ground-truth summary in the form of frame-level importance scores (calculated as an average of the key-fragment user summaries per frame) is also provided. In the case of TVSum, videos have been annotated by 20 viewers/users in the form of frame-level importance scores. Similar to SumMe, a single ground-truth summary in the form of frame-level importance scores (computed after averaging all users' scores) is provided for each video of the dataset.

Evaluation Approach
For fair comparison with other video summarization algorithms, we adopt the evaluation protocol proposed in [29]. The similarity between an automatically generated (A) and a ground-truth summary (G) is computed by the F-Score (as percentage), where (P)recision and (R)ecall measure the temporal overlap (∩) between the summaries (|| * || denotes duration): A thorough study of the relevant literature indicated that most works evaluate the performance of video summarization based on the key-fragment protocol introduced in [29]. As stated before, the ground-truth annotations for the SumMe dataset are already available in the form of key-fragments, and thus can be used directly for evaluation. Nevertheless, the annotation of the TVSum videos is available only in the form of frame-level importance scores. To tackle this, the frame-level ground-truth annotations of the TVSum videos are converted to key-fragment-based summaries following the approach presented in [24,29]. In particular, the videos are temporally segmented into non-overlapping fragments using the KTS method [20]. Then, fragment-level importance scores are computed by averaging the importance score of the frames of each fragment, and the calculated scores are used for ranking the fragments. Finally, a subset of fragments is selected to form the video summary, such that the summary duration does not exceed 15% of the video's length. In most cases, the latter is performed using the Knapsack algorithm, as proposed in [24,29]. Given the above technical background, we found out that there is a slight but significant distinction with respect to what is eventually used as ground-truth summary for evaluating the performance of a video summarization algorithm. In particular, a number of works (see Table 7) compare the generated summary for a given video against the single ground-truth summary that is available for that video (mainly for supervised training). Differently to this approach, the majority of works (see Tables 4 and 5) evaluate the efficiency of the generated summary for a given video by assessing its similarity with all the available human-generated (a.k.a. user) summaries for that video. Driven by the fact that video summarization is a highly subjective task, we argue that exploiting existing knowledge from many summaries of the same video can lead to more concrete and reliable results. Hence, in our assessments we follow the evaluation protocol that involves all human-generated summaries. More precisely, given a video, we compare the generated summary with the available user summaries and compute an F-Score for each pair of generated and user summary. Then, we average the computed F-Scores (in the case of TVSum) or keep the maximum of them (in the case of SumMe, following the recommendation of the authors of this dataset (see [9])) and end up with the final F-Score for this video. The computed F-Scores for the entire set of testing samples are finally averaged to capture the final outcome about the algorithm's performance. For fair comparison with methods that adopt the single ground-truth summary evaluation approach, we report our model's performance based on this approach too.

Preliminary Study on Datasets
Aiming to get some insights about the used datasets, we examined the following aspects: • the efficiency of a randomly generated summary (frames' importance scores were defined based on a uniform distribution of probabilities, and the experiment was performed 100 times); • the human performance, i.e. how well a human annotator would perform based on the preferences of the remaining annotators; this is a measure regarding the compatibility/agreement between the human-defined summaries; • an estimate about the highest performance on TVSum 1 according to the best human-generated summary (with the highest overlap) for each video of the dataset. For completeness, in Table 1 we report the outcomes of our study using both criteria for calculating the video-level F-Scores, i.e. the maximum of the computed F-Scores in the case of SumMe, and the average of these scores in the case of TVSum. The results Table 1: Performance (F-Score (%)) of different types of summaries and the theoretical upper-bound of the SumMe and TVSum dataset, based on the "average" and "max" criteria. -which are consistent with the findings of a recently published study on these datasets [17] -clearly indicate that video summarization is a highly subjective task, as there is no ideal summary that exhibits significant overlap with all annotators' preferences, in both datasets. Moreover, the "average" metric in the case of TVSum shows that human performance is comparable with the efficiency of a randomly generated summary, and thus limits the available space for improvement. In particular, the best possible summary (i.e. a summary that matches the best human-generated summary for each different video of the dataset) results in a score that is approximately 10 units higher than the score of a random summary. Given the reasonable lack of an objective summary for a video, we argue that the "max" criterion is more suitable for assessing the performance of video summarization approaches. In this sense, the upper-bound with respect to video summarization efficiency will be 100% in both datasets, denoting that machine-generated summaries are indistinguishable from human-generated ones.

Implementation Details
We downsampled all videos to 2 fps. For fair comparison with several works (including [16]), we used the output of pool5 layer of GoogleNet [25] trained on ImageNet, for representing the visual content of the video frames. The linear compression layer reduces the size of these feature vectors from 1024 to 500. Each component of the architecture is comprised of a 2-layer LSTM, with 500 hidden units in each layer, while as in [16] the frame selector is a bi-directional LSTM. Training is based on the Adam optimizer and the learning rate for all components but the discriminator is 10 −4 ; for the latter one equals to 10 −5 . For evaluation, we followed the standard 5-fold cross validation approach (i.e. 80% of videos used for training and the rest 20% for testing) and, in the next sections, we report the average performance over the 5 runs. Finally, we implemented our method in PyTorch 2 .

Performance Evaluation
The performance of the proposed variation of the SUM-GAN model was initially evaluated for several values of the regularization factor σ , ranging between 0.05 and 0.5. Experiments for greater values were omitted as the method's performance was reduced for the highest tested value. The results reported in Table 2 indicate that: i) the regularization factor clearly affects the performance (as also reported in [16]) and thus needs fine-tuning; ii) too small and too large values lead to reduced efficiency, and only a specific range of values results in good performance; iii) fine-tuning of σ seems to be dataset-dependent, as the highest performance is achieved for different values of σ in each dataset. For fair comparison with other video summarization methods that rely on a strictly-defined set of (hyper-)parameters, in the following we refer to our model with σ = 0.1, since the gain compared to the model's performance in SumMe for σ = 0.3 is higher than the observed mitigation in TVSum for σ = 0.1. The training curves of this model for 100 epochs of training on SumMe and TVSum, are illustrated in Figs. 4 and 5 respectively. In both cases the model starts from approx. the performance of a randomly-generated summary and develops knowledge about the task (the fluctuation in the case of SumMe is reasonable due to the adversarial nature of the training), which results in a noticeable improvement of its summarization efficiency. The peak value was observed in epoch 93 for SumMe and in epoch 98 for TVSum.
Before delving into more details with respect to the conducted comparisons with the current state of the art, in Table 3 we present our findings regarding the effect of the selected criterion for training the GAN part of the architecture, on the model's performance. The replacement of the MSE by the BCE loss led to a noticeable decrease in the algorithm's efficiency on SumMe, while maintained its performance on TVSum. Therefore, it seems that the use of the MSE loss can be beneficial in the case of limited training data (for SumMe we used 20 training samples), enabling the model to converge in a state that achieves higher performance. The results for TVSum indicate that both criteria result to similar efficiency on   larger sets of training samples (for TVSum we used 40 training samples), that allow the GAN to be updated with similar effectiveness over the training epochs, in both cases. 58.0 Our model was compared against the performance of a randomly generated summary and of other state-of-the-art unsupervised approaches on SumMe and TVSum. The original SUM-GAN method is not listed in this table as it follows a different evaluation protocol, and the comparison with it is reported in the sequel (see Tables 6 and 7). The reported data in Table 4 3 show that: i) the performance of a few SoA methods is comparable (or even worse) than that of a random summary generator; ii) the best method on SumMe (UnpairedVSN) performs slightly better than our method, while it is clearly less competitive on TVSum; iii) the best algorithm on TVSum (Tessellation) achieves random-level performance on SumMe, a fact that indicates it is a dataset-tailored technique. Contrary to the above, our approach performs consistently well in both datasets, thus being the most competitive one among the compared techniques.
Furthermore, the efficiency of our unsupervised method was compared against the performance of supervised approaches for video summarization (which is a comparison that is rather unfair for the proposed unsupervised model). From the data presented in Table 5 it is shown that: i) the two best methods in TVSum (MAVS and Tessellationsup respectively) are highly-adapted to this dataset, as they exhibit random-level performance on SumMe; ii) only a few supervised methods clearly surpass the performance of a randomlygenerated summary on both datasets, with VASNet being the best 3 The scores for each method are from the corresponding paper. Table 4: Comparison (F-Score (%)) with different unsupervised video summarization approaches on SumMe and TV-Sum, taking under consideration all human-generated summaries for each video. +/− indicate better/worse performance compared to SUM-GAN-sl.

Ablation Study
To see how each introduced change influences the performance of the proposed model we conducted an ablation study. The variations taken under consideration, as well as their performance on SumMe and TVSum, are reported in Table 8. From these values it seems that: i) the replacement of the incremental training of the architecture, by the sequential one described in [16] leads to a significant performance reduction on SumMe and a slight decrement on TVSum (see Var. 3); ii) a similar effect is observed with respect to the linear compression layer (see Var. 2), as its removal results in a bit lower performance (compared to Var. 3) in both datasets; iii) the addition of the linear compression layer and the application of the incremental training for the model's components (see Var. 1) led to a clear performance improvement in SumMe (more than 2%) and a slight amelioration in TVSum (reaching 0.5% compared to Var. 2); iv) the introduction of the stepwise, label-based training strategy for the GAN module of the architecture, advanced further the model's performance on SumMe (by 0.8%) and maintained the same efficiency on TVSum. The above indicate that the incremental training approach is beneficial in case of small training datasets, while its contribution is less pronounced in case of larger datasets. Similarly, the addition of a linear layer that significantly reduces the amount of trained parameters advances the model's training capacity in case of small training sets (as for SumMe), while a lower impact is observed in case of larger training sets (as for TVSum). A possible justification for the above findings is that the amount of training samples in the case of TVSum is adequate for learning a larger set of parameters even in a 1-step training. The application of the stepwise, labelbased learning approach enables the adversarial part of the model to converge to a better state through a more fine-grained update of the discriminator's gradients and the use of a more strictly defined learning task for the generator. This strategy seems to be advantageous in the case of small training sets, while it maintains the same levels of (state-of-the-art) performance when larger groups of training samples are used. To sum up, the applied changes contributed to significantly improve the performance of the original SUM-GAN model, and the introduced GAN-training approach allowed the model to reach higher levels of performance on SumMe, making it comparable with the best-performing unsupervised method.

CONCLUSIONS AND NEXT STEPS
This paper reported our study for assessing and advancing the effectiveness of an unsupervised video summarization method that is based on adversarial learning. Focusing on the SUM-GAN model and after assessing the efficiency of a variation of it, we suggested a new training approach to advance the learning efficiency of the adversarial module of the architecture. A thorough study of the evaluation protocols and metrics, and experiments on two datasets allowed to estimate the possible performance on these datasets and the suitability of the used metrics. Comparative evaluations showed that our model performs consistently well on both datasets and is among the best unsupervised methods, while its efficiency make it comparable with supervised algorithms too. An ablation study proved the contribution of each applied change and the gain offered by the proposed stepwise, label-based adversarial training strategy. In the future we plan to put effort on further improving our model, e.g. by exploiting the efficiency of attention networks and the training capacity of reinforcement learning approaches, and we will investigate approaches for video summarization that is tailored to specific targeted audience and distribution channel.

ACKNOWLEDGMENTS
This work was supported by the EUs Horizon 2020 research and innovation programme under grant agreement H2020-780656 ReTV.