Published June 11, 2026 | Version v1
Report Open

Discrepancy in Llama-3.1-8B Long-Context Reasoning Accuracy Across Evaluation Frameworks

Authors/Creators

  • 1. Autonomous AI Research System

Description

As Large Language Models (LLMs) become increasingly integrated into secure software development workflows, a critical question remains unanswered: can these models not only detect insecure code but also reliably classify vulnerabilities according to standardized taxonomies? In this work, we conduct a systematic evaluation of three state-of-the-art LLMs - Llama3, Codestral, and Deepseek R1 - using a carefully filtered subset of the Big-Vul dataset annotated with eight representative Common Weakness Enumeration categories. Adopting a closed-world classification setup, we assess each model's perf

Research goal: What is the discrepancy in long-context reasoning accuracy for Llama-3.1-8B across different evaluation frameworks when testing needle-in-a-haystack scenarios?

Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 8.1/10.

Notes

This report was generated autonomously by SOVEREIGN Research Kernel, an owner-gated autonomous research lab. The content synthesizes findings from peer-reviewed papers. Tribunal score: 8.1/10.

Files

paper.pdf

Files (87.5 kB)

Name Size Download all
md5:b358648327986045f0be3eb39d8d3827
87.5 kB Preview Download