Discrepancy in Llama-3.1-8B Long-Context Reasoning Accuracy Across Evaluation Frameworks
Description
As Large Language Models (LLMs) become increasingly integrated into secure software development workflows, a critical question remains unanswered: can these models not only detect insecure code but also reliably classify vulnerabilities according to standardized taxonomies? In this work, we conduct a systematic evaluation of three state-of-the-art LLMs - Llama3, Codestral, and Deepseek R1 - using a carefully filtered subset of the Big-Vul dataset annotated with eight representative Common Weakness Enumeration categories. Adopting a closed-world classification setup, we assess each model's perf
Research goal: What is the discrepancy in long-context reasoning accuracy for Llama-3.1-8B across different evaluation frameworks when testing needle-in-a-haystack scenarios?
Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 8.1/10.
Notes
Files
paper.pdf
Files
(87.5 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:b358648327986045f0be3eb39d8d3827
|
87.5 kB | Preview Download |