Published August 14, 2025 | Version v1
Conference proceeding Open

DMIS Lab at MedHopQA-2025: Ensemble Multi-Retrieval Methodologies with Reasoning Language Model Decision

  • 1. Department of Computer Science and Engineering, Korea University
  • 2. College of Medicine, Hanyang University
  • 3. AIGEN Sciences
  • 4. Department of Biosystems Science and Engineering, ETH Zurich

Description

Abstract

Robust and trustworthy biomedical question answering (QA) remains a critical challenge for large language models (LLMs), especially in complex domains such as rare diseases, where information is fragmented across multiple sources. MedHopQA 2025 benchmark introduces 10,000 multi-step reasoning questions curated from Wikipedia, requiring systems to extract, connect, and synthesize biomedical knowledge across interlinked documents. In this work, we present a retrieval-augmented generation (RAG) and decision-making framework that integrates diverse retrieval strategies, including Query2Doc-based, Rationale-based, and Web-augmented retrieval, and employs a dedicated decision-maker model to select or directly generate the most accurate and well-reasoned answers. Our system not only leverages evidence from both Wikipedia and the web but also explicitly evaluates and compares candidate answers to ensure answer reliability and reasoning transparency. Experiments on the test set demonstrate that our ensemble approach achieves state-of-the-art performance, highlighting the importance of hybrid retrieval and robust decision-making in advancing biomedical multi-step QA.

 

This article is part of the Proceedings of the BioCreative IX Challenge and Workshop (BC9): Large Language Models for Clinical and Biomedical NLP at the International Joint Conference on Artificial Intelligence (IJCAI).

Files

BC9_paper10.pdf

Files (221.4 kB)

Name Size Download all
md5:56f6225da26f7538c67a105a8186f4fa
221.4 kB Preview Download