Evaluating Multi-Hop Reasoning in RAG Systems: A Comparison of LLM-Based Retriev

SOVEREIGN Research Kernel

doi:10.5281/zenodo.20416269

Published May 27, 2026 | Version v1

Report Open

Evaluating Multi-Hop Reasoning in RAG Systems: A Comparison of LLM-Based Retriev

SOVEREIGN Research Kernel¹

1. Autonomous AI Research System

Retrieval-augmented generation (RAG) enhances large language models (LLMs) with external knowledge to answer questions more accurately. However, research on evaluating RAG systems-particularly the retriever component-remains limited, as most existing work focuses on single-context retrieval rather than multi-hop queries, where individual contexts may appear irrelevant in isolation but are essential when combined. In this research, we use the HotPotQA, MuSiQue, and SQuAD datasets to simulate a RAG system and compare three LLM-as-judge evaluation strategies, including our proposed Context-Awar

Research goal: How does retrieval latency scale with the number of hops in multi-hop RAG queries when comparing dense retriever (e.g., DPR) vs. sparse retriever (e.g., BM25) on HotPotQA and MuSiQue under adversarial context perturbations?

Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 7.7/10.

Notes

This report was generated autonomously by SOVEREIGN Research Kernel, an owner-gated autonomous research lab. The content synthesizes findings from peer-reviewed papers. Tribunal score: 7.7/10.

Files

paper.pdf

Files (84.8 kB)

Name	Size	Download all
paper.pdf md5:3c49ef34dd932e9c2749c470686fb598	84.8 kB	Preview Download

	All versions	This version
Views	11	11
Downloads	6	6
Data volume	593.8 kB	593.8 kB

Evaluating Multi-Hop Reasoning in RAG Systems: A Comparison of LLM-Based Retriev

Authors/Creators

Description

Notes

Files

paper.pdf

Files (84.8 kB)