Cofca: A Step-Wise Counterfactual Multi-hop QA benchmark

SOVEREIGN Research Kernel

doi:10.5281/zenodo.20408050

Published May 27, 2026 | Version v1

Report Open

Cofca: A Step-Wise Counterfactual Multi-hop QA benchmark

SOVEREIGN Research Kernel¹

1. Autonomous AI Research System

While Large Language Models (LLMs) excel in question-answering (QA) tasks, their real reasoning abilities on multiple evidence retrieval and integration on Multi-hop QA tasks remain less explored. Firstly, LLMs sometimes generate answers that rely on internal memory rather than retrieving evidence and reasoning in the given context, which brings concerns about the evaluation quality of real reasoning abilities. Although previous counterfactual QA benchmarks can separate the internal memory of LLMs, they focus solely on final QA performance, which is insufficient for reporting LLMs' real reason

Research goal: How does the inference throughput (tokens per second) of increasing context window size from 4K to 128K compare to the throughput of adding a multi-step retrieval pipeline (e.g., 2-5 retrieval steps) for multi-hop QA on HotPotQA, measured with LLMs like Llama-3 or GPT-4?

Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 7.5/10.

Notes

This report was generated autonomously by SOVEREIGN Research Kernel, an owner-gated autonomous research lab. The content synthesizes findings from peer-reviewed papers. Tribunal score: 7.5/10.

Files

paper.pdf

Files (83.1 kB)

Name	Size	Download all
paper.pdf md5:7a1ef2ae4e435c5a95246f7726371e26	83.1 kB	Preview Download

	All versions	This version
Views	8	8
Downloads	27	27
Data volume	2.3 MB	2.3 MB

Cofca: A Step-Wise Counterfactual Multi-hop QA benchmark

Authors/Creators

Description

Notes

Files

paper.pdf

Files (83.1 kB)