To what extent does the choice of LLM-as-a-judge (e.g., GPT-4 vs. Llama-3-70B) affect the relative ranking of
Description
Most Reading Comprehension methods limit themselves to queries which can be answered using a single sentence, paragraph, or document. Enabling models to combine disjoint pieces of textual evidence would extend the scope of machine comprehension methods, but currently no resources exist to train and test this capability. We propose a novel task to encourage the development of models for text understanding across multiple documents and to investigate the limits of existing methods. In our task, a model learns to seek and combine evidence — effectively performing multihop, alias multi-step, infer
Research goal: To what extent does the choice of LLM-as-a-judge (e.g., GPT-4 vs. Llama-3-70B) affect the relative ranking of retrieval strategies (iterative reranking vs. long-context) on multi-hop reasoning accuracy in HotPotQA?
Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 8.5/10.
Notes
Files
paper.pdf
Files
(90.6 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:26220fcfc8f81d2ce55696215210b53d
|
90.6 kB | Preview Download |