SM-MrHiSum and SM-VideoXum Datasets for Script-driven Multimodal Video Summarization
Authors/Creators
Description
The SM-MrHiSum and SM-VideoXum are two large-scale datasets suitable for training and evaluation of methods for script-driven multimodal video summarization.
The original MrHiSum dataset (Sul et al., 2024) was constructed from a curated subset of YouTube-8M videos, where highlight annotations were derived from YouTube’s “Most Replayed” statistics. These video replay statistics, aggregated from at least 50 unique viewers per video, serve as a reliable indicator of audience engagement. Each video was annotated at the frame level with importance scores, representing highlight intensity. Ground-truth video summaries were generated based on a predefined temporal segmentation of the videos and by solving the Knapsack problem for a given time-budget about the summary duration, ensuring that the obtained summaries are concise while covering key highlights. In total, the dataset contains 31,892 videos and the associated ground-truth annotations, supporting the training and evaluation of methods for video highlight detection and summarization.
To make MrHiSum suitable for script-driven multimodal video summarization, we extended it by producing textual descriptions of the human-annotated summaries and extracting audio transcripts, forming the SM-MrHiSum dataset. For this, the visual content of each ground-truth video summary (sampled at 1 fps) was described by Qwen3-VL-8B-Instruct which was prompted to "describe the scenery and the main persons and activities shown in the video". Audio transcripts were extracted through a two-step pipeline: the speech was isolated from background noise using a pretrained model of Silero for voice activity detection, and then speech-to-text was performed using a pretrained model of Whisper, which outputs a series of timestamped transcripts. The created SM-MrHiSum dataset contains 29,917 videos, where each video is associated with: a) ground-truth summary, b) a textual description of this summary, and c) a set of timestamped audio transcripts.
The SM-VideoXum dataset is an extension of the VideoXum dataset for cross-modal video summarization (and of S-VideoXum), that is suitable for training and evaluation of methods for script-driven multimodal video summarization. The multiple ground-truth summaries that are available per video of VideoXum, were associated with textual descriptions of their visual content, generated using Qwen3-VL-8B-Instruct and prompting it to "describe the scenery and the main persons and activities shown in the video". Moreover, audio transcripts were extracted from the full-length videos following the approach described above for the videos of the SM-MrHiSum dataset. The created SM-VideoXum dataset contains 11,908 videos, where each video is associated with: a) 10 ground-truth summaries, b) 10 textual descriptions of its summaries (one description per summary), and c) a set of timestamped audio transcripts.
In our implementations and experiments, all the visual, textual, and transcript data of the SM-MrHiSum and SM-VideoXum datasets have been represented using CLIP-based embeddings. The details of the scripts, embeddings and all other data that we release as part of this repository are reported in SD-MVSum_Datasets_readme.md
More information on the released datasets, along with technical details of the SD-MVSum script-driven multimodal video summarization method that we developed, can be found in the following preprint: https://arxiv.org/abs/2510.05652
Files
SD-MVSum_Datasets_readme.md
Files
(21.0 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:56eaeb5ea32c217ac1dba360f1839b1a
|
6.7 kB | Preview Download |
|
md5:dcf7d3babc54d5f9cd76f73ff0fa0002
|
56.3 MB | Preview Download |
|
md5:b200481acd62054b789a82699e67247c
|
177.0 MB | Preview Download |
|
md5:ddd34cbd313cbc2ce6e075e79e48e0dc
|
15.4 GB | Preview Download |
|
md5:e8fff231b15e86d4eda45bda3116031c
|
93.1 MB | Preview Download |
|
md5:c195fb8872612f39578f4c232144da47
|
170.2 MB | Preview Download |
|
md5:0432ccd898fd6f27da2b5a5d4fb9d355
|
5.1 GB | Preview Download |