Impact of Temporal Alignment Noise in Synthetic Video-Text Pairs on Zero-Shot Video Moment Retrieval Accuracy
Description
Adapting large-scale image-text pre-training models, e.g., CLIP, to the video domain represents the current state-of-the-art for text-video retrieval. The primary approaches involve transferring text-video pairs to a common embedding space and leveraging cross-modal interactions on specific entities for semantic alignment. Though effective, these paradigms entail prohibitive computational costs, leading to inefficient retrieval. To address this, we propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL), which capitalizes on latent shared semantics across modal
Research goal: How does temporal alignment noise in synthetic video-text pairs affect zero-shot Video Moment Retrieval accuracy on the Charades-STA benchmark?
Autonomous synthesis report generated by Assignee Research. Tribunal consensus score: 8.8/10.
Notes
Files
paper.pdf
Files
(77.7 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:fea11cb59a27a5aee7b2953d00ca6ab8
|
77.7 kB | Preview Download |