Published June 2025 | Version v1
Conference paper Open

An LLM Framework for Long-form Video Retrieval and Audio-Visual Question Answering Using Qwen2/2.5

  • 1. ROR icon Centre for Research and Technology Hellas
  • 2. ROR icon Queen Mary University of London

Description

This paper presents our approach to tackle the tasks of Known Item Search (KIS) and Video Question Answering (Video QA) by combining state-of-the-art LLMs and cross-modal video retrieval methods. Regarding the KIS task, we analyze and decompose input queries into meaningful and easy-to-handle single-sentence sub-queries using an LLM, and for each sub-query we retrieve the relevant video shots using a learnable cross-modal network. Subsequently, we employ an aggregation module to combine the results of all the sub-queries into a single ranked list of retrieved shots. Regarding the Video QA task, following the retrieval of relevant videos using the aforementioned approach, we propose a methodology that is suitable for audio-visual question answering on long videos. Specifically, we adopt a caption-based LLM framework, which we augment with an audio processing component. To make this efficiently applicable on long videos, we design a keyword-based frame and audio segment selection mechanism, utilizing multimodal LLMs for filtering. This enables our framework to focus on the salient segments of the video. In addition, we implement an LLM-based self-feedback mechanism to check whether the candidate responses answer the original question, which makes our Video QA approach more robust to imperfect retrieval results.

Files

ivise2025_cr.pdf

Files (3.5 MB)

Name Size Download all
md5:695efd6f98ad683e1577b0f5e168ab12
3.5 MB Preview Download

Additional details

Funding

European Commission
TransMIXR - Ignite the Immersive Media Sector by Enabling New Narrative Visions 101070109
European Commission
AI4TRUST - AI-based-technologies for trustworthy solutions against disinformation 101070190