Published April 21, 2026 | Version 1.0.0
Preprint Open

BM25 and Dense Retrieval Are Complementary for Portuguese Clinical Text: An Empirical Study of Hybrid RAG Across 500 Clinical Queries

Description

How should retrieval-augmented generation systems be configured for clinical decision support in Portuguese? We evaluate 500 clinical queries across 6 medical specialties comparing BM25, dense, and hybrid retrieval. Four findings: (1) BM25 and hybrid retrieval surface statistically distinct document sets (McNemar p<0.001), confirming complementarity; (2) dense-only retrieval fails for 22.2% of queries; (3) authority-weighted scoring affects ranking but not recall; (4) inter-annotator agreement reaches kappa=0.954, validating LLM-as-judge for Portuguese clinical text. Deterministic citation verification eliminates hallucinations entirely (461/500 vs 1/500, Fisher p<0.001).

Files

Paper_BM25_Dense_Complementary_Portuguese_Clinical.pdf

Files (156.3 kB)

Additional details

Dates

Created
2026-04-24
Initial preprint version

Software

Repository URL
https://github.com/nomad-link-id
Programming language
TypeScript
Development Status
Active

References

  • Singhal, K. et al. (2023). Large language models encode clinical knowledge. Nature, 620:172-180.
  • Nori, H. et al. (2023). Capabilities of GPT-4 on medical competency examinations. arXiv:2303.13375.
  • Cormack, G., Clarke, C., Buettcher, S. (2009). Reciprocal rank fusion outperforms condorcet and individual rank learning methods. SIGIR, 758-759.
  • Gao, T. et al. (2023). Enabling large language models to generate text with citations. EMNLP 2023.
  • Ma, X. et al. (2023). Zero-shot listwise document reranking with a large language model. arXiv:2305.02156.
  • Zheng, L. et al. (2023). Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. NeurIPS 2023.
  • Zakka, C. et al. (2024). Almanac: Retrieval-augmented language models for clinical medicine. NEJM AI, 1(2).
  • Xiong, G. et al. (2024). Benchmarking retrieval-augmented generation for medicine. ACL Findings 2024.
  • Guyatt, G. et al. (2008). GRADE: An emerging consensus on rating quality of evidence. BMJ, 336:924-926.
  • Landis, J.R., Koch, G.G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1):159-174.
  • Ji, Z. et al. (2023). Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1-38.
  • Lewis, P. et al. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. NeurIPS, 33:9459-9474.
  • Min, S. et al. (2023). FActScore: Fine-grained atomic evaluation of factual precision. EMNLP 2023.
  • Karpukhin, V. et al. (2020). Dense passage retrieval for open-domain question answering. EMNLP, 6769-6781.
  • Thakur, N. et al. (2021). BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. NeurIPS Datasets and Benchmarks.
  • McNemar, Q. (1947). Note on the sampling error of the difference between correlated proportions. Psychometrika, 12(2):153-157
  • Chen, J. et al. (2024). Hybrid retrieval with reciprocal rank fusion for domain-specific QA. ACL 2024.
  • Thomas, P. et al. (2024). Large language models can accurately predict searcher preferences. SIGIR 2024.