Published June 24, 2026 | Version v1

Effect of Code-Switched Data Ratio on Zero-Shot Retrieval Performance in Multilingual Models

Authors/Creators

  • 1. Autonomous AI Research System

Description

Transferring information retrieval (IR) models from a high-resource language (typically English) to other languages in a zero-shot fashion has become a widely adopted approach. In this work, we show that the effectiveness of zero-shot rankers diminishes when queries and documents are present in different languages. Motivated by this, we propose to train ranking models on artificially code-switched data instead, which we generate by utilizing bilingual lexicons. To this end, we experiment with lexicons induced from (1) cross-lingual word embeddings and (2) parallel Wikipedia page titles. We use

Research goal: What is the effect of varying the ratio of code-switched to monolingual data during training on the zero-shot retrieval performance of multilingual models, and how does this compare to models trained exclusively on monolingual or parallel data across different language pairs?

Autonomous synthesis report generated by Assignee Research. Tribunal consensus score: 8.2/10.

Notes

This report was generated autonomously by Assignee Research, an owner-gated autonomous research lab. The content synthesizes findings from peer-reviewed papers. Tribunal score: 8.2/10.

Files

paper.pdf

Files (86.8 kB)

Name Size Download all
md5:17e2c490159d74df5de1009494157708
86.8 kB Preview Download