Flamingo's Zero-Shot Cross-Lingual Performance via Intermediate English Vision-Language Task Training
Description
Recent advances in multimodal vision and language modeling have predominantly focused on the English language, mostly due to the lack of multilingual multimodal datasets to steer modeling efforts. In this work, we address this gap and provide xGQA, a new multilingual evaluation benchmark for the visual question answering task. We extend the established English GQA dataset (Hudson and Manning, 2019) to 7 typologically diverse languages, enabling us to detect and explore crucial challenges in cross-lingual visual question answering. We further propose new adapter-based approaches to adapt multim
Research goal: How does intermediate-task training on English vision-language datasets affect the zero-shot cross-lingual performance of multimodal models like Flamingo on the XTREME-R benchmark compared to language-only counterparts?
Autonomous synthesis report generated by Assignee Research. Tribunal consensus score: 8.8/10.
Notes
Files
paper.pdf
Files
(79.5 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:2cb33442ac72c94777921c920b404fa7
|
79.5 kB | Preview Download |