SM-MrHiSum and SM-VideoXum Datasets for Script-driven Multimodal Video Summarization

Mylonas, Manolis; Zerva, Charalampia; Apostolidis, Evlampios; Mezaris, Vasileios

doi:10.5281/zenodo.19919955

Published 2026 | Version v2

Dataset Open

SM-MrHiSum and SM-VideoXum Datasets for Script-driven Multimodal Video Summarization

1. Centre for Research and Technology Hellas

The SM-MrHiSum and SM-VideoXum are two large-scale datasets suitable for training and evaluation of methods for script-driven multimodal video summarization.

The original MrHiSum dataset (Sul et al., 2024) was constructed from a curated subset of YouTube-8M videos, where highlight annotations were derived from YouTube’s “Most Replayed” statistics. These video replay statistics, aggregated from at least 50 unique viewers per video, serve as a reliable indicator of audience engagement. Each video was annotated at the frame level with importance scores, representing highlight intensity. Ground-truth video summaries were generated based on a predefined temporal segmentation of the videos and by solving the Knapsack problem for a given time-budget about the summary duration, ensuring that the obtained summaries are concise while covering key highlights. In total, the dataset contains 31,892 videos and the associated ground-truth annotations, supporting the training and evaluation of methods for video highlight detection and summarization.

To make MrHiSum suitable for script-driven multimodal video summarization, we extended it by producing textual descriptions of the human-annotated summaries and extracting audio transcripts, forming the SM-MrHiSum dataset. For this, the visual content of each ground-truth video summary (sampled at 1 fps) was described by Qwen3-VL-8B-Instruct which was prompted to "describe the scenery and the main persons and activities shown in the video". Audio transcripts were extracted through a two-step pipeline: the speech was isolated from background noise using a pretrained model of Silero for voice activity detection, and then speech-to-text was performed using a pretrained model of Whisper, which outputs a series of timestamped transcripts. The created SM-MrHiSum dataset contains 29,917 videos, where each video is associated with: a) ground-truth summary, b) a textual description of this summary, and c) a set of timestamped audio transcripts.

The SM-VideoXum dataset is an extension of the VideoXum dataset for cross-modal video summarization (and of S-VideoXum), that is suitable for training and evaluation of methods for script-driven multimodal video summarization. The multiple ground-truth summaries that are available per video of VideoXum, were associated with textual descriptions of their visual content, generated using Qwen3-VL-8B-Instruct and prompting it to "describe the scenery and the main persons and activities shown in the video". Moreover, audio transcripts were extracted from the full-length videos following the approach described above for the videos of the SM-MrHiSum dataset. The created SM-VideoXum dataset contains 11,908 videos, where each video is associated with: a) 10 ground-truth summaries, b) 10 textual descriptions of its summaries (one description per summary), and c) a set of timestamped audio transcripts.

In our implementations and experiments, all the visual, textual, and transcript data of the SM-MrHiSum and SM-VideoXum datasets have been represented using CLIP-based embeddings. The details of the scripts, embeddings and all other data that we release as part of this repository are reported in SD-MVSum_Datasets_readme.md

More information on the released datasets, along with technical details of the SD-MVSum script-driven multimodal video summarization method that we developed, can be found in the following preprint: https://arxiv.org/abs/2510.05652

Files

SD-MVSum_Datasets_readme.md

Files (21.0 GB)

Name	Size
SD-MVSum_Datasets_readme.md md5:56eaeb5ea32c217ac1dba360f1839b1a	6.7 kB	Preview Download
SM-MrHiSum-Text-Annotations.zip md5:dcf7d3babc54d5f9cd76f73ff0fa0002	56.3 MB	Preview Download
SM-MrHiSum-Trained-Model.zip md5:b200481acd62054b789a82699e67247c	177.0 MB	Preview Download
SM-MrHiSum-Training-Data.zip md5:ddd34cbd313cbc2ce6e075e79e48e0dc	15.4 GB	Preview Download
SM-VideoXum-Text-Annotations.zip md5:e8fff231b15e86d4eda45bda3116031c	93.1 MB	Preview Download
SM-VideoXum-Trained-Model.zip md5:c195fb8872612f39578f4c232144da47	170.2 MB	Preview Download
SM-VideoXum-Training-Data.zip md5:0432ccd898fd6f27da2b5a5d4fb9d355	5.1 GB	Preview Download

	All versions	This version
Views	225	74
Downloads	111	19
Data volume	1.2 TB	52.4 GB

SM-MrHiSum and SM-VideoXum Datasets for Script-driven Multimodal Video Summarization

Authors/Creators

Description

Files

SD-MVSum_Datasets_readme.md

Files (21.0 GB)