BM25 and Dense Retrieval Are Complementary for Portuguese Clinical Text: An Empirical Study of Hybrid RAG Across 500 Clinical Queries

Eduardo, Igor

doi:10.5281/zenodo.19686739

Published April 21, 2026 | Version 1.0.0

Preprint Open

BM25 and Dense Retrieval Are Complementary for Portuguese Clinical Text: An Empirical Study of Hybrid RAG Across 500 Clinical Queries

Eduardo, Igor (Researcher)

How should retrieval-augmented generation systems be configured for clinical decision support in Portuguese? We evaluate 500 clinical queries across 6 medical specialties comparing BM25, dense, and hybrid retrieval. Four findings: (1) BM25 and hybrid retrieval surface statistically distinct document sets (McNemar p<0.001), confirming complementarity; (2) dense-only retrieval fails for 22.2% of queries; (3) authority-weighted scoring affects ranking but not recall; (4) inter-annotator agreement reaches kappa=0.954, validating LLM-as-judge for Portuguese clinical text. Deterministic citation verification eliminates hallucinations entirely (461/500 vs 1/500, Fisher p<0.001).

Files

Paper_BM25_Dense_Complementary_Portuguese_Clinical.pdf

Files (156.3 kB)

Name	Size	Download all
Paper_BM25_Dense_Complementary_Portuguese_Clinical.pdf md5:42b3cb38c8b9fb18408ed68e6034b7fc	156.3 kB	Preview Download

Additional details

URL: https://igoreduardo.com

Is supplement to: Dataset: https://huggingface.co/datasets/igor-eduardo-research/mirage-pt (URL)
Is supplemented by: Software: https://github.com/nomad-link-id/hybrid-rag-pipeline (URL); Software: https://github.com/nomad-link-id/citation-guard (URL); Software: https://github.com/nomad-link-id/regulated-ai-audit (URL)

Created: 2026-04-24

Initial preprint version

Repository URL: https://github.com/nomad-link-id
Programming language: TypeScript
Development Status: Active

Singhal, K. et al. (2023). Large language models encode clinical knowledge. Nature, 620:172-180.
Nori, H. et al. (2023). Capabilities of GPT-4 on medical competency examinations. arXiv:2303.13375.
Cormack, G., Clarke, C., Buettcher, S. (2009). Reciprocal rank fusion outperforms condorcet and individual rank learning methods. SIGIR, 758-759.
Gao, T. et al. (2023). Enabling large language models to generate text with citations. EMNLP 2023.
Ma, X. et al. (2023). Zero-shot listwise document reranking with a large language model. arXiv:2305.02156.
Zheng, L. et al. (2023). Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. NeurIPS 2023.
Zakka, C. et al. (2024). Almanac: Retrieval-augmented language models for clinical medicine. NEJM AI, 1(2).
Xiong, G. et al. (2024). Benchmarking retrieval-augmented generation for medicine. ACL Findings 2024.
Guyatt, G. et al. (2008). GRADE: An emerging consensus on rating quality of evidence. BMJ, 336:924-926.
Landis, J.R., Koch, G.G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1):159-174.
Ji, Z. et al. (2023). Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1-38.
Lewis, P. et al. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. NeurIPS, 33:9459-9474.
Min, S. et al. (2023). FActScore: Fine-grained atomic evaluation of factual precision. EMNLP 2023.
Karpukhin, V. et al. (2020). Dense passage retrieval for open-domain question answering. EMNLP, 6769-6781.
Thakur, N. et al. (2021). BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. NeurIPS Datasets and Benchmarks.
McNemar, Q. (1947). Note on the sampling error of the difference between correlated proportions. Psychometrika, 12(2):153-157
Chen, J. et al. (2024). Hybrid retrieval with reciprocal rank fusion for domain-specific QA. ACL 2024.
Thomas, P. et al. (2024). Large language models can accurately predict searcher preferences. SIGIR 2024.

	All versions	This version
Views	38	38
Downloads	26	26
Data volume	6.9 MB	6.9 MB

Paper_BM25_Dense_Complementary_Portuguese_Clinical.pdf

Files (156.3 kB)

Identifiers

Related works

Dates

Software

References

BM25 and Dense Retrieval Are Complementary for Portuguese Clinical Text: An Empirical Study of Hybrid RAG Across 500 Clinical Queries

Authors/Creators

Description

Files

Paper_BM25_Dense_Complementary_Portuguese_Clinical.pdf

Files (156.3 kB)

Additional details

Identifiers

Related works

Dates

Software

References