Published May 27, 2022 | Version v1
Conference paper Open

EXPLOITING CAPTION DIVERSITY FOR UNSUPERVISED VIDEO SUMMARIZATION

  • 1. ROR icon Aristotle University of Thessaloniki

Description

Most unsupervised Deep Neural Networks (DNNs) for video summarization rely on adversarial learning, autoencoding and training without utilizing any ground-truth summary. In several cases, the Convolutional Neural Network (CNN)-derived video frame representations are sequentially fed to a Long Short-Term Memory (LSTM) network, which selects key frames and, during training, attempts to reconstruct the original/full video from the summary, while confusing an adversarially optimized Discriminator. Additionally, regularizers aiming at maximizing the summary's visual semantic diversity can be employed, such as the Determinantal Point Process (DPP) loss term. In this paper, a novel DPP-based regularizer is proposed that exploits a pretrained DNN-based image captioner in order to additionally enforce maximal keyframe diversity from the perspective of textual semantic content. Thus, the selected key-frames are encouraged to differ not only with regard to what objects they depict, but also with regard to their textual descriptions, which may additionally capture activities, scene context, etc. Empirical evaluation indicates that the proposed regularizer leads to state-of-the-art performance.

Files

Kaseris_ICASSP2022_ExploitingCaptioningForUnsupervisedVideoSummarization.pdf

Additional details

Funding

AI4Media – A European Excellence Centre for Media, Society and Democracy 951911
European Commission

Dates

Accepted
2022-05-27