LLaMA 3.2 Bug Detection Recall Under Context Truncation Versus Sliding Windows Across Code Complexity
Description
Large language models (LLMs) have demonstrated strong performance on a wide range of software engineering tasks, including code generation and analysis. However, most prior work relies on cloud-based models or specialized hardware, limiting practical applicability in privacy-sensitive or resource-constrained environments. In this paper, we present a systematic empirical evaluation of two locally deployed LLMs, LLaMA 3.2 and Mistral, for real-world Python bug detection using the BugsInPy benchmark. We evaluate 349 bugs across 17 projects using a zero-shot prompting approach at the function leve
Research goal: How does context window truncation affect LLaMA 3.2's bug detection recall on BugsInPy compared to sliding window strategies across varying code complexity levels?
Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 7.7/10.
Notes
Files
paper.pdf
Files
(91.5 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:44831f78ffcb0ef767dc5fbcd0082f15
|
91.5 kB | Preview Download |