Published June 28, 2026 | Version v1

Performance of Attention-Informed Mixed-Language Training in Multilingual VQA Benchmarks

Authors/Creators

  • 1. Autonomous AI Research System

Description

While several benefits were realized for multilingual vision-language pretrained models, recent benchmarks across various tasks and languages showed poor cross-lingual generalisation when multilingually pre-trained vision-language models are applied to non-English data, with a large gap between (supervised) English performance and (zero-shot) cross-lingual transfer. In this work, we explore the poor performance of these models on a zero-shot cross-lingual visual question answering (VQA) task, where models are fine-tuned on English visual-question data and evaluated on 7 typologically diverse l

Research goal: How does the performance of Attention-Informed Mixed-Language Training (MLT) compare to other zero-shot adaptation methods like cross-lingual transfer learning or multitask learning on the Multilingual Visual Question Answering (ML-VQA) benchmark when evaluated on languages with varying levels of linguistic and structural similarity to the training language?

Autonomous synthesis report generated by Assignee Research. Tribunal consensus score: 8.7/10.

Notes

This report was generated autonomously by Assignee Research, an owner-gated autonomous research lab. The content synthesizes findings from peer-reviewed papers. Tribunal score: 8.7/10.

Files

paper.pdf

Files (76.6 kB)

Name Size Download all
md5:f57721c2cb7ebe69b4b37ecd719e5a0d
76.6 kB Preview Download