A Dataset and Metric for Textual Video Content Description
Authors/Creators
Description
Obtaining textual descriptions of the visual content of images and videos is often required in multimedia analysis and retrieval. Traditional video captioning approaches are usually evaluated on very short captions using rather simple metrics from NLP, while multimodal large language model (MLLM)-based approaches are mostly evaluated with question answering, which is query specific. We provide a dataset (FM-V2T) with 258 video clips from a media archive, annotated with detailed manually curated descriptions in English and German (long and short). We propose an LLM-based metric, which assesses the entailment and contradiction of facts extracted from a description with a reference, addressing shortcomings of existing metrics small changes with semantic impact and comparing descriptions with substantially different lengths. We provide experimental results on the reliability of the metric, and apply it to baseline results of three MLLM-based approaches on the FM-V2T dataset, comparing it with other metrics.
Files
ACM_MM2025_Video2Text-5.pdf
Files
(684.6 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:75342768a59b5a5de27888c07b07e489
|
684.6 kB | Preview Download |