Published October 27, 2025 | Version v1
Publication Open

A Dataset and Metric for Textual Video Content Description

Description

Obtaining textual descriptions of the visual content of images and videos is often required in multimedia analysis and retrieval. Traditional video captioning approaches are usually evaluated on very short captions using rather simple metrics from NLP, while multimodal large language model (MLLM)-based approaches are mostly evaluated with question answering, which is query specific. We provide a dataset (FM-V2T) with 258 video clips from a media archive, annotated with detailed manually curated descriptions in English and German (long and short). We propose an LLM-based metric, which assesses the entailment and contradiction of facts extracted from a description with a reference, addressing shortcomings of existing metrics small changes with semantic impact and comparing descriptions with substantially different lengths. We provide experimental results on the reliability of the metric, and apply it to baseline results of three MLLM-based approaches on the FM-V2T dataset, comparing it with other metrics.

Files

ACM_MM2025_Video2Text-5.pdf

Files (684.6 kB)

Name Size Download all
md5:75342768a59b5a5de27888c07b07e489
684.6 kB Preview Download

Additional details

Funding

European Commission
XRECO - XR mEdia eCOsystem 101070250
Austrian Research Promotion Agency
FAIRmedia