Scaling Behavior of Multilingual Dense Retrievers Trained on WebFAQ 2.0 Subsets and Downstream QA Performance
Description
We present WebFAQ, a large-scale collection of open-domain question answering datasets derived from FAQ-style schema.org annotations. In total, the data collection consists of 96 million natural question-answer (QA) pairs across 75 languages, including 47 million (49\%) non-English samples. WebFAQ further serves as the foundation for 20 monolingual retrieval benchmarks with a total size of 11.2 million QA pairs (5.9 million non-English). These datasets are carefully curated through refined filtering and near-duplicate detection, yielding high-quality resources for training and evaluating multil
Research goal: What is the scaling behavior of multilingual dense retrievers when trained on varying subsets of WebFAQ 2.0's 198M QA pairs, and how does it correlate with performance gains on downstream QA benchmarks like XQuAD or MLQA?
Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 7.7/10.
Notes
Files
paper.pdf
Files
(84.4 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:67a1a703c094dc71e702fbed77eed5e4
|
84.4 kB | Preview Download |