Cofca: A Step-Wise Counterfactual Multi-hop QA benchmark
Description
While Large Language Models (LLMs) excel in question-answering (QA) tasks, their real reasoning abilities on multiple evidence retrieval and integration on Multi-hop QA tasks remain less explored. Firstly, LLMs sometimes generate answers that rely on internal memory rather than retrieving evidence and reasoning in the given context, which brings concerns about the evaluation quality of real reasoning abilities. Although previous counterfactual QA benchmarks can separate the internal memory of LLMs, they focus solely on final QA performance, which is insufficient for reporting LLMs' real reason
Research goal: How does the inference throughput (tokens per second) of increasing context window size from 4K to 128K compare to the throughput of adding a multi-step retrieval pipeline (e.g., 2-5 retrieval steps) for multi-hop QA on HotPotQA, measured with LLMs like Llama-3 or GPT-4?
Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 7.5/10.
Notes
Files
paper.pdf
Files
(83.1 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:7a1ef2ae4e435c5a95246f7726371e26
|
83.1 kB | Preview Download |