Overview of the MedHopQA track at BioCreative IX: track description, participation and evaluation of systems for multi-hop medical question answering

Islamaj, Rezarta; Chan, Joey; Leaman, Robert; Lu, Zhiyong

doi:10.5281/zenodo.16992154

Published August 21, 2025 | Version v2

Conference proceeding Open

Overview of the MedHopQA track at BioCreative IX: track description, participation and evaluation of systems for multi-hop medical question answering

1. United States National Library of Medicine
2. National Center for Biotechnology Information
3. National Institutes of Health

Abstract

The advent of large language models (LLMs) has revolutionized Natural Language Processing (NLP), particularly question answering (QA) across domains such as biomedicine and healthcare. The MedHopQA track at BioCreative IX calls on participants to develop QA systems capable of multi‑hop reasoning for complex biomedical queries. With the growing use of LLMs in these fields, it is crucial to evaluate their reasoning abilities: we must understand what knowledge LLMs possess and what techniques can improve their responses on complex, multi‑hop questions rather than on simple, single‑hop questions. To increase the diversity of question types, we developed a dataset of 1,000 question–answer pairs—focused on diseases, genes, and chemicals, primarily related to rare diseases—based on publicly available Wikipedia data, enabling evaluation of system performance across a wide range of biomedical QA tasks. Each question is curated by referencing two interconnected Wikipedia pages, requiring models to integrate knowledge across sources to generate accurate answers.
Participants in the MedHopQA track received each question alongside its identifier and were asked to provide: (1) a short answer of no more than three words, and (2) an optional long answer detailing the system’s reasoning. We received 48 submissions from 13 teams worldwide for the official test phase. We also received 19 submissions in the unofficial round. Each submission underwent evaluation using Exact Match (EM), which measures strict string‑to‑string correspondence of the short answer, and the MedCPT score, which assesses retrieval of the correct biomedical concepts regardless of exact wording. These metrics together allow us to capture both precise answer accuracy and deeper conceptual understanding. The highest performance achieved was an 89.30% F1 score on the MedCPT metric and an 87.30% Exact Match. This demonstrates that system‑level enhancements can boost LLM performance well beyond the zero‑shot baseline of 67.40% MedCPT and 60.20% EM. The MedHopQA track dataset and other challenge materials are available at (https://www.ncbi.nlm.nih.gov/research/bionlp/medhopqa) and (https://www.codabench.org/competitions/7609/).
This challenge demonstrated that substantial improvements in LLM short‑answer responses require the integration of techniques like Retrieval Augmented Generation (RAG) to inject additional knowledge and boost performance substantially. We look forward to further advancements in question‑answering systems—especially for complex, multi‑hop queries—and invite the community to utilize the 1,000 QA pairs developed for the Med-HopQA shared task at BioCreative IX.

This article is part of the Proceedings of the BioCreative IX Challenge and Workshop (BC9): Large Language Models for Clinical and Biomedical NLP at the International Joint Conference on Artificial Intelligence (IJCAI).

Files

BC9_paper08.pdf

Files (931.6 kB)

Name	Size	Download all
BC9_paper08.pdf md5:62835c87de5aa819bfb2b7d02bf1ac11	931.6 kB	Preview Download

	All versions	This version
Views	379	350
Downloads	301	269
Data volume	349.4 MB	307.4 MB

Overview of the MedHopQA track at BioCreative IX: track description, participation and evaluation of systems for multi-hop medical question answering

Authors/Creators

Description

Abstract

Files

BC9_paper08.pdf

Files (931.6 kB)