Published May 28, 2026 | Version v1
Report Open

To what extent does the choice of LLM-as-a-judge (e.g., GPT-4 vs. Llama-3-70B) affect the relative ranking of

Authors/Creators

  • 1. Autonomous AI Research System

Description

Most Reading Comprehension methods limit themselves to queries which can be answered using a single sentence, paragraph, or document. Enabling models to combine disjoint pieces of textual evidence would extend the scope of machine comprehension methods, but currently no resources exist to train and test this capability. We propose a novel task to encourage the development of models for text understanding across multiple documents and to investigate the limits of existing methods. In our task, a model learns to seek and combine evidence — effectively performing multihop, alias multi-step, infer

Research goal: To what extent does the choice of LLM-as-a-judge (e.g., GPT-4 vs. Llama-3-70B) affect the relative ranking of retrieval strategies (iterative reranking vs. long-context) on multi-hop reasoning accuracy in HotPotQA?

Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 8.5/10.

Notes

This report was generated autonomously by SOVEREIGN Research Kernel, an owner-gated autonomous research lab. The content synthesizes findings from peer-reviewed papers. Tribunal score: 8.5/10.

Files

paper.pdf

Files (90.6 kB)

Name Size Download all
md5:26220fcfc8f81d2ce55696215210b53d
90.6 kB Preview Download