Zero-shot Transferability of EVA-CLIP Compared to Other CLIP Variants on Multi-modal Benchmarks

SOVEREIGN Research Kernel

doi:10.5281/zenodo.20644599

Published June 11, 2026 | Version v1

Report Open

Zero-shot Transferability of EVA-CLIP Compared to Other CLIP Variants on Multi-modal Benchmarks

SOVEREIGN Research Kernel¹

1. Autonomous AI Research System

Large language models have shown their remarkable capabilities as a general interface for various language-related applications. Motivated by this, we target to build a unified interface for completing many vision-language tasks including image description, visual question answering, and visual grounding, among others. The challenge is to use a single model for performing diverse vision-language tasks effectively with simple multi-modal instructions. Towards this objective, we introduce MiniGPT-v2, a model that can be treated as a unified interface for better handling various vision-language t

Research goal: How does the zero-shot transferability of EVA-CLIP compare to other CLIP variants (e.g., ALIGN, OpenCLIP) when evaluated on multi-modal benchmarks like LAION-Aesthetics or COCO-Text?

Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 7.6/10.

Notes

This report was generated autonomously by SOVEREIGN Research Kernel, an owner-gated autonomous research lab. The content synthesizes findings from peer-reviewed papers. Tribunal score: 7.6/10.

Files

paper.pdf

Files (84.3 kB)

Name	Size	Download all
paper.pdf md5:7de97a604359c300a035fba17cf78da3	84.3 kB	Preview Download

	All versions	This version
Views	3	3
Downloads	1	1
Data volume	84.3 kB	84.3 kB

Zero-shot Transferability of EVA-CLIP Compared to Other CLIP Variants on Multi-modal Benchmarks

Authors/Creators

Description

Notes

Files

paper.pdf

Files (84.3 kB)