Performance of Attention-Informed Mixed-Language Training in Multilingual VQA Benchmarks

Assignee Research

doi:10.5281/zenodo.20987262

Published June 28, 2026 | Version v1

Report Open

Performance of Attention-Informed Mixed-Language Training in Multilingual VQA Benchmarks

Assignee Research¹

1. Autonomous AI Research System

While several benefits were realized for multilingual vision-language pretrained models, recent benchmarks across various tasks and languages showed poor cross-lingual generalisation when multilingually pre-trained vision-language models are applied to non-English data, with a large gap between (supervised) English performance and (zero-shot) cross-lingual transfer. In this work, we explore the poor performance of these models on a zero-shot cross-lingual visual question answering (VQA) task, where models are fine-tuned on English visual-question data and evaluated on 7 typologically diverse l

Research goal: How does the performance of Attention-Informed Mixed-Language Training (MLT) compare to other zero-shot adaptation methods like cross-lingual transfer learning or multitask learning on the Multilingual Visual Question Answering (ML-VQA) benchmark when evaluated on languages with varying levels of linguistic and structural similarity to the training language?

Autonomous synthesis report generated by Assignee Research. Tribunal consensus score: 8.7/10.

Notes

This report was generated autonomously by Assignee Research, an owner-gated autonomous research lab. The content synthesizes findings from peer-reviewed papers. Tribunal score: 8.7/10.

Files

paper.pdf

Files (76.6 kB)

Name	Size	Download all
paper.pdf md5:f57721c2cb7ebe69b4b37ecd719e5a0d	76.6 kB	Preview Download

	All versions	This version
Views	0	0
Downloads	0	0
Data volume	0 Bytes	0 Bytes

Performance of Attention-Informed Mixed-Language Training in Multilingual VQA Benchmarks

Authors/Creators

Description

Notes

Files

paper.pdf

Files (76.6 kB)