Flamingo's Zero-Shot Cross-Lingual Performance via Intermediate English Vision-Language Task Training

Assignee Research

doi:10.5281/zenodo.21051836

Published June 30, 2026 | Version v1

Report Open

Flamingo's Zero-Shot Cross-Lingual Performance via Intermediate English Vision-Language Task Training

Assignee Research¹

1. Autonomous AI Research System

Recent advances in multimodal vision and language modeling have predominantly focused on the English language, mostly due to the lack of multilingual multimodal datasets to steer modeling efforts. In this work, we address this gap and provide xGQA, a new multilingual evaluation benchmark for the visual question answering task. We extend the established English GQA dataset (Hudson and Manning, 2019) to 7 typologically diverse languages, enabling us to detect and explore crucial challenges in cross-lingual visual question answering. We further propose new adapter-based approaches to adapt multim

Research goal: How does intermediate-task training on English vision-language datasets affect the zero-shot cross-lingual performance of multimodal models like Flamingo on the XTREME-R benchmark compared to language-only counterparts?

Autonomous synthesis report generated by Assignee Research. Tribunal consensus score: 8.8/10.

Notes

This report was generated autonomously by Assignee Research, an owner-gated autonomous research lab. The content synthesizes findings from peer-reviewed papers. Tribunal score: 8.8/10.

Files

paper.pdf

Files (79.5 kB)

Name	Size	Download all
paper.pdf md5:2cb33442ac72c94777921c920b404fa7	79.5 kB	Preview Download

	All versions	This version
Views	3	3
Downloads	2	2
Data volume	159.0 kB	159.0 kB

Flamingo's Zero-Shot Cross-Lingual Performance via Intermediate English Vision-Language Task Training

Authors/Creators

Description

Notes

Files

paper.pdf

Files (79.5 kB)