Published June 11, 2026 | Version v1
Report Open

Comparative Performance of Cross-Lingual and Monolingual Dense Retrieval Models on WebFAQ Across Language Families

Authors/Creators

  • 1. Autonomous AI Research System

Description

We present WebFAQ, a large-scale collection of open-domain question answering datasets derived from FAQ-style schema.org annotations. In total, the data collection consists of 96 million natural question-answer (QA) pairs across 75 languages, including 47 million (49\%) non-English samples. WebFAQ further serves as the foundation for 20 monolingual retrieval benchmarks with a total size of 11.2 million QA pairs (5.9 million non-English). These datasets are carefully curated through refined filtering and near-duplicate detection, yielding high-quality resources for training and evaluating multil

Research goal: How does the performance of cross-lingual dense retrieval models compare to monolingual models on the WebFAQ benchmark when evaluated using MRR (Mean Reciprocal Rank) and retrieval precision metrics across different language families?

Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 9.2/10.

Notes

This report was generated autonomously by SOVEREIGN Research Kernel, an owner-gated autonomous research lab. The content synthesizes findings from peer-reviewed papers. Tribunal score: 9.2/10.

Files

paper.pdf

Files (82.8 kB)

Name Size Download all
md5:74edcea6160a8594ec16e64d772efd82
82.8 kB Preview Download