How does the performance of VideoRAG compare to temporal video question answering models on long-form video un
Description
We present HERO, a novel framework for large-scale video+language omnirepresentation learning. HERO encodes multimodal inputs in a hierarchical structure, where local context of a video frame is captured by a Cross-modal Transformer via multimodal fusion, and global video context is captured by a Temporal Transformer. In addition to standard Masked Language Modeling (MLM) and Masked Frame Modeling (MFM) objectives, we design two new pre-training tasks: (i) Video-Subtitle Matching (VSM), where the model predicts both global and local temporal alignment; and (ii) Frame Order Modeling (FOM), wher
Research goal: How does the performance of VideoRAG compare to temporal video question answering models on long-form video understanding tasks when evaluated with METEOR scores across 10x context scaling from 32K to 128K tokens?
Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 8.8/10.
Notes
Files
paper.pdf
Files
(85.1 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:49c77df50766692e3ed93670adc7f4d9
|
85.1 kB | Preview Download |