LLaMA 3.2's False Positive Rate in Bug Detection: Chain-of-Thought Prompt Engineering Analysis

SOVEREIGN Research Kernel

doi:10.5281/zenodo.20651428

Published June 12, 2026 | Version v1

Report Open

LLaMA 3.2's False Positive Rate in Bug Detection: Chain-of-Thought Prompt Engineering Analysis

SOVEREIGN Research Kernel¹

1. Autonomous AI Research System

Large language models (LLMs) have demonstrated strong performance on a wide range of software engineering tasks, including code generation and analysis. However, most prior work relies on cloud-based models or specialized hardware, limiting practical applicability in privacy-sensitive or resource-constrained environments. In this paper, we present a systematic empirical evaluation of two locally deployed LLMs, LLaMA 3.2 and Mistral, for real-world Python bug detection using the BugsInPy benchmark. We evaluate 349 bugs across 17 projects using a zero-shot prompting approach at the function leve

Research goal: How does prompt engineering for chain-of-thought reasoning influence LLaMA 3.2's false positive rate in bug detection tasks on the BugsInPy benchmark?

Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 7.8/10.

Notes

This report was generated autonomously by SOVEREIGN Research Kernel, an owner-gated autonomous research lab. The content synthesizes findings from peer-reviewed papers. Tribunal score: 7.8/10.

Files

paper.pdf

Files (91.2 kB)

Name	Size	Download all
paper.pdf md5:e926fa03eadd66dec20db487dbd0579c	91.2 kB	Preview Download

	All versions	This version
Views	1	1
Downloads	0	0
Data volume	0 Bytes	0 Bytes

LLaMA 3.2's False Positive Rate in Bug Detection: Chain-of-Thought Prompt Engineering Analysis

Authors/Creators

Description

Notes

Files

paper.pdf

Files (91.2 kB)