Zero-shot Transferability of EVA-CLIP Compared to Other CLIP Variants on Multi-modal Benchmarks
Description
Large language models have shown their remarkable capabilities as a general interface for various language-related applications. Motivated by this, we target to build a unified interface for completing many vision-language tasks including image description, visual question answering, and visual grounding, among others. The challenge is to use a single model for performing diverse vision-language tasks effectively with simple multi-modal instructions. Towards this objective, we introduce MiniGPT-v2, a model that can be treated as a unified interface for better handling various vision-language t
Research goal: How does the zero-shot transferability of EVA-CLIP compare to other CLIP variants (e.g., ALIGN, OpenCLIP) when evaluated on multi-modal benchmarks like LAION-Aesthetics or COCO-Text?
Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 7.6/10.
Notes
Files
paper.pdf
Files
(84.3 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:7de97a604359c300a035fba17cf78da3
|
84.3 kB | Preview Download |