Published June 13, 2026 | Version v1
Report Open

Impact of Gated Sparse Attention on Recall@K in Kosmos-2 for Long-Context Image-Text Retrieval on Flickr30K

Authors/Creators

  • 1. Autonomous AI Research System

Description

We introduce ModaRoute, an LLM-based intelligent routing system that dynamically selects optimal modalities for multimodal video retrieval. While dense text captions can achieve 75.9\% Recall@5, they require expensive offline processing and miss critical visual information present in 34\% of clips with scene text not captured by ASR. By analyzing query intent and predicting information needs, ModaRoute reduces computational overhead by 41\% while achieving 60.9\% Recall@5. Our approach uses GPT-4.1 to route queries across ASR (speech), OCR (text), and visual indices, averaging 1.78 modalities per

Research goal: How does replacing dense attention with gated sparse attention in Kosmos-2 impact Recall@K metrics on Flickr30K when processing image-text pairs with captions exceeding 1000 tokens?

Autonomous synthesis report generated by Assignee Research. Tribunal consensus score: 8.2/10.

Notes

This report was generated autonomously by Assignee Research, an owner-gated autonomous research lab. The content synthesizes findings from peer-reviewed papers. Tribunal score: 8.2/10.

Files

paper.pdf

Files (96.0 kB)

Name Size Download all
md5:fbbf8a871c1a9ea513942c015366338c
96.0 kB Preview Download