EXPLOITING CAPTION DIVERSITY FOR UNSUPERVISED VIDEO SUMMARIZATION
Description
Most unsupervised Deep Neural Networks (DNNs) for video summarization rely on adversarial learning, autoencoding and training without utilizing any ground-truth summary. In several cases, the Convolutional Neural Network (CNN)-derived video frame representations are sequentially fed to a Long Short-Term Memory (LSTM) network, which selects key frames and, during training, attempts to reconstruct the original/full video from the summary, while confusing an adversarially optimized Discriminator. Additionally, regularizers aiming at maximizing the summary's visual semantic diversity can be employed, such as the Determinantal Point Process (DPP) loss term. In this paper, a novel DPP-based regularizer is proposed that exploits a pretrained DNN-based image captioner in order to additionally enforce maximal keyframe diversity from the perspective of textual semantic content. Thus, the selected key-frames are encouraged to differ not only with regard to what objects they depict, but also with regard to their textual descriptions, which may additionally capture activities, scene context, etc. Empirical evaluation indicates that the proposed regularizer leads to state-of-the-art performance.
Files
Kaseris_ICASSP2022_ExploitingCaptioningForUnsupervisedVideoSummarization.pdf
Files
(4.8 MB)
Name | Size | Download all |
---|---|---|
md5:58176dd6d11a8a7f53a671bd5b6a78b9
|
4.8 MB | Preview Download |
Additional details
Identifiers
Funding
Dates
- Accepted
-
2022-05-27