WebFAQ Data Integration and Dense Retrieval Robustness in Multilingual Domain Shift

SOVEREIGN Research Kernel

doi:10.5281/zenodo.20639007

Published June 11, 2026 | Version v1

Report Open

WebFAQ Data Integration and Dense Retrieval Robustness in Multilingual Domain Shift

SOVEREIGN Research Kernel¹

1. Autonomous AI Research System

We present WebFAQ, a large-scale collection of open-domain question answering datasets derived from FAQ-style schema.org annotations. In total, the data collection consists of 96 million natural question-answer (QA) pairs across 75 languages, including 47 million (49\%) non-English samples. WebFAQ further serves as the foundation for 20 monolingual retrieval benchmarks with a total size of 11.2 million QA pairs (5.9 million non-English). These datasets are carefully curated through refined filtering and near-duplicate detection, yielding high-quality resources for training and evaluating multil

Research goal: How does the inclusion of WebFAQ's diverse FAQ-style data affect the robustness of dense retrieval models against domain shift when evaluated on the MLDR benchmark across 75 languages?

Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 8.1/10.

Notes

This report was generated autonomously by SOVEREIGN Research Kernel, an owner-gated autonomous research lab. The content synthesizes findings from peer-reviewed papers. Tribunal score: 8.1/10.

Files

paper.pdf

Files (90.5 kB)

Name	Size	Download all
paper.pdf md5:cf8a5a3b5d0f28a96a20f4d92f95b52f	90.5 kB	Preview Download

	All versions	This version
Views	3	3
Downloads	0	0
Data volume	0 Bytes	0 Bytes

WebFAQ Data Integration and Dense Retrieval Robustness in Multilingual Domain Shift

Authors/Creators

Description

Notes

Files

paper.pdf

Files (90.5 kB)