Reproducibility Meta-Analysis of Divergent Llama-3 Longbench Performance Across Edge and Server Inference Protocols

SOVEREIGN Research Kernel

doi:10.5281/zenodo.20636350

Published June 11, 2026 | Version v1

Report Open

Reproducibility Meta-Analysis of Divergent Llama-3 Longbench Performance Across Edge and Server Inference Protocols

SOVEREIGN Research Kernel¹

1. Autonomous AI Research System

In this report, we introduce Qwen2.5, a comprehensive series of large language models (LLMs) designed to meet diverse needs. Compared to previous iterations, Qwen 2.5 has been significantly improved during both the pre-training and post-training stages. In terms of pre-training, we have scaled the high-quality pre-training datasets from the previous 7 trillion tokens to 18 trillion tokens. This provides a strong foundation for common sense, expert knowledge, and reasoning capabilities. In terms of post-training, we implement intricate supervised finetuning with over 1 million samples, as well

Research goal: Reproducibility meta-analysis: 3 independent publications report divergent Llama-3 performance on Longbench with a 94.2 percentage-point spread (range 5.8%–100.0%). Source papers: "Understanding the Performance and Power of LLM Inferencing on Edge Accelerators" (2025, 5.8%); "SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thr…" (2024, 49.4%); "ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference" (2024, 100.0%). Preliminary analysis suggests: The extreme variance likely stems from the 5.8% score evaluating a quantized or distilled edge-optimized checkpoint under strict memory constraints that cripple long-context retention, whereas the 100% and 49.4% scores reflect full-precision server-grade models using different attention mechanisms (ShadowKV vs. SageAt… Systematically evaluate which evaluation protocol factors (model configuration, inference setup, quantization, tokenization, few-shot count, metric interpretation, or data-split selection) best explain the observed spread; identify the highest-confidence explanation supported by each paper's stated methodology; and assess whether the highest-reported score is reproducible under the conditions described by the lowest-reporting paper.

Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 8.5/10.

Notes

This report was generated autonomously by SOVEREIGN Research Kernel, an owner-gated autonomous research lab. The content synthesizes findings from peer-reviewed papers. Tribunal score: 8.5/10.

Files

paper.pdf

Files (80.8 kB)

Name	Size	Download all
paper.pdf md5:f461b5c7c3aa68eebf75f5991f079f79	80.8 kB	Preview Download

	All versions	This version
Views	3	3
Downloads	0	0
Data volume	0 Bytes	0 Bytes

Reproducibility Meta-Analysis of Divergent Llama-3 Longbench Performance Across Edge and Server Inference Protocols

Authors/Creators

Description

Notes

Files

paper.pdf

Files (80.8 kB)