How does the performance of multimodal models on Visual Genome benchmark tasks vary when trained with differen

SOVEREIGN Research Kernel

doi:10.5281/zenodo.20438559

Published May 29, 2026 | Version v1

Report Open

How does the performance of multimodal models on Visual Genome benchmark tasks vary when trained with differen

SOVEREIGN Research Kernel¹

1. Autonomous AI Research System

Despite progress in perceptual tasks such as image classification, computers still perform poorly on cognitive tasks such as image description and question answering. Cognition is core to tasks that involve not just recognizing, but reasoning about our visual world. However, models used to tackle the rich content in images for cognitive tasks are still being trained using the same datasets designed for perceptual tasks. To achieve success at cognitive tasks, models need to understand the interactions and relationships between objects in an image. When asked ``What vehicle is the person riding?''

Research goal: How does the performance of multimodal models on Visual Genome benchmark tasks vary when trained with different vision-language pretraining objectives, measured by caption generation BLEU scores and visual question answering accuracy across object, attribute, and relationship prediction subtasks?

Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 7.5/10.

Notes

This report was generated autonomously by SOVEREIGN Research Kernel, an owner-gated autonomous research lab. The content synthesizes findings from peer-reviewed papers. Tribunal score: 7.5/10.

Files

paper.pdf

Files (84.3 kB)

Name	Size	Download all
paper.pdf md5:63ea9c29e8f8f733ecd146334aa29aa1	84.3 kB	Preview Download

	All versions	This version
Views	3	3
Downloads	1	1
Data volume	168.6 kB	168.6 kB

How does the performance of multimodal models on Visual Genome benchmark tasks vary when trained with differen

Authors/Creators

Description

Notes

Files

paper.pdf

Files (84.3 kB)