Discrepancy in Llama-3.1-8B Long-Context Reasoning Accuracy Across Evaluation Frameworks

SOVEREIGN Research Kernel

doi:10.5281/zenodo.20636492

Published June 11, 2026 | Version v1

Report Open

Discrepancy in Llama-3.1-8B Long-Context Reasoning Accuracy Across Evaluation Frameworks

SOVEREIGN Research Kernel¹

1. Autonomous AI Research System

As Large Language Models (LLMs) become increasingly integrated into secure software development workflows, a critical question remains unanswered: can these models not only detect insecure code but also reliably classify vulnerabilities according to standardized taxonomies? In this work, we conduct a systematic evaluation of three state-of-the-art LLMs - Llama3, Codestral, and Deepseek R1 - using a carefully filtered subset of the Big-Vul dataset annotated with eight representative Common Weakness Enumeration categories. Adopting a closed-world classification setup, we assess each model's perf

Research goal: What is the discrepancy in long-context reasoning accuracy for Llama-3.1-8B across different evaluation frameworks when testing needle-in-a-haystack scenarios?

Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 8.1/10.

Notes

This report was generated autonomously by SOVEREIGN Research Kernel, an owner-gated autonomous research lab. The content synthesizes findings from peer-reviewed papers. Tribunal score: 8.1/10.

Files

paper.pdf

Files (87.5 kB)

Name	Size	Download all
paper.pdf md5:b358648327986045f0be3eb39d8d3827	87.5 kB	Preview Download

	All versions	This version
Views	1	1
Downloads	0	0
Data volume	0 Bytes	0 Bytes

Discrepancy in Llama-3.1-8B Long-Context Reasoning Accuracy Across Evaluation Frameworks

Authors/Creators

Description

Notes

Files

paper.pdf

Files (87.5 kB)