Published April 21, 2026
| Version 1.0.0
Preprint
Open
BM25 and Dense Retrieval Are Complementary for Portuguese Clinical Text: An Empirical Study of Hybrid RAG Across 500 Clinical Queries
Authors/Creators
Description
How should retrieval-augmented generation systems be configured for clinical decision support in Portuguese? We evaluate 500 clinical queries across 6 medical specialties comparing BM25, dense, and hybrid retrieval. Four findings: (1) BM25 and hybrid retrieval surface statistically distinct document sets (McNemar p<0.001), confirming complementarity; (2) dense-only retrieval fails for 22.2% of queries; (3) authority-weighted scoring affects ranking but not recall; (4) inter-annotator agreement reaches kappa=0.954, validating LLM-as-judge for Portuguese clinical text. Deterministic citation verification eliminates hallucinations entirely (461/500 vs 1/500, Fisher p<0.001).
Files
Paper_BM25_Dense_Complementary_Portuguese_Clinical.pdf
Files
(156.3 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:42b3cb38c8b9fb18408ed68e6034b7fc
|
156.3 kB | Preview Download |
Additional details
Identifiers
Related works
- Is supplement to
- Dataset: https://huggingface.co/datasets/igor-eduardo-research/mirage-pt (URL)
- Is supplemented by
- Software: https://github.com/nomad-link-id/hybrid-rag-pipeline (URL)
- Software: https://github.com/nomad-link-id/citation-guard (URL)
- Software: https://github.com/nomad-link-id/regulated-ai-audit (URL)
Dates
- Created
-
2026-04-24Initial preprint version
Software
- Repository URL
- https://github.com/nomad-link-id
- Programming language
- TypeScript
- Development Status
- Active
References
- Singhal, K. et al. (2023). Large language models encode clinical knowledge. Nature, 620:172-180.
- Nori, H. et al. (2023). Capabilities of GPT-4 on medical competency examinations. arXiv:2303.13375.
- Cormack, G., Clarke, C., Buettcher, S. (2009). Reciprocal rank fusion outperforms condorcet and individual rank learning methods. SIGIR, 758-759.
- Gao, T. et al. (2023). Enabling large language models to generate text with citations. EMNLP 2023.
- Ma, X. et al. (2023). Zero-shot listwise document reranking with a large language model. arXiv:2305.02156.
- Zheng, L. et al. (2023). Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. NeurIPS 2023.
- Zakka, C. et al. (2024). Almanac: Retrieval-augmented language models for clinical medicine. NEJM AI, 1(2).
- Xiong, G. et al. (2024). Benchmarking retrieval-augmented generation for medicine. ACL Findings 2024.
- Guyatt, G. et al. (2008). GRADE: An emerging consensus on rating quality of evidence. BMJ, 336:924-926.
- Landis, J.R., Koch, G.G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1):159-174.
- Ji, Z. et al. (2023). Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1-38.
- Lewis, P. et al. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. NeurIPS, 33:9459-9474.
- Min, S. et al. (2023). FActScore: Fine-grained atomic evaluation of factual precision. EMNLP 2023.
- Karpukhin, V. et al. (2020). Dense passage retrieval for open-domain question answering. EMNLP, 6769-6781.
- Thakur, N. et al. (2021). BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. NeurIPS Datasets and Benchmarks.
- McNemar, Q. (1947). Note on the sampling error of the difference between correlated proportions. Psychometrika, 12(2):153-157
- Chen, J. et al. (2024). Hybrid retrieval with reciprocal rank fusion for domain-specific QA. ACL 2024.
- Thomas, P. et al. (2024). Large language models can accurately predict searcher preferences. SIGIR 2024.