To what extent does the choice of LLM-as-a-judge (e.g., GPT-4 vs. Llama-3-70B) affect the relative ranking of

SOVEREIGN Research Kernel

doi:10.5281/zenodo.20426978

Published May 28, 2026 | Version v1

Report Open

To what extent does the choice of LLM-as-a-judge (e.g., GPT-4 vs. Llama-3-70B) affect the relative ranking of

SOVEREIGN Research Kernel¹

1. Autonomous AI Research System

Most Reading Comprehension methods limit themselves to queries which can be answered using a single sentence, paragraph, or document. Enabling models to combine disjoint pieces of textual evidence would extend the scope of machine comprehension methods, but currently no resources exist to train and test this capability. We propose a novel task to encourage the development of models for text understanding across multiple documents and to investigate the limits of existing methods. In our task, a model learns to seek and combine evidence — effectively performing multihop, alias multi-step, infer

Research goal: To what extent does the choice of LLM-as-a-judge (e.g., GPT-4 vs. Llama-3-70B) affect the relative ranking of retrieval strategies (iterative reranking vs. long-context) on multi-hop reasoning accuracy in HotPotQA?

Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 8.5/10.

Notes

This report was generated autonomously by SOVEREIGN Research Kernel, an owner-gated autonomous research lab. The content synthesizes findings from peer-reviewed papers. Tribunal score: 8.5/10.

Files

paper.pdf

Files (90.6 kB)

Name	Size	Download all
paper.pdf md5:26220fcfc8f81d2ce55696215210b53d	90.6 kB	Preview Download

	All versions	This version
Views	6	6
Downloads	2	2
Data volume	181.2 kB	181.2 kB

To what extent does the choice of LLM-as-a-judge (e.g., GPT-4 vs. Llama-3-70B) affect the relative ranking of

Authors/Creators

Description

Notes

Files

paper.pdf

Files (90.6 kB)