Published January 6, 2026 | Version v1
Conference paper Open

A Localization Framework for Reasoning Faults in LLM-based Code Agents

Description

 

Autonomous LLM-based code agents are rapidly advancing, yet their practical utility is hindered by a critical gap in our under- standing of their failures. When an agent fails, the root cause is often not a simple code bug but a complex flaw in its internal rea- soning process, which is captured in unstructured "hypothesis" logs. To date, there has been no annotated dataset to systemati- cally study this phenomenon, especially on complex, competitive programming-style tasks. To address this gap, we introduce APPS- Failure-DB, a new benchmark dataset of 2450 unique, validated reasoning failures from an LLM agent (Qwen 7B Coder) tasked with solving problems from the APPS dataset. We generated this dataset using a novel "Oracle-Annotated Pipeline," where a blind Oracle Agent (GPT-4o) validates and provides ground-truth anno- tations for each logical failure. Building on this dataset, we propose and evaluate a new, tool-assisted localization framework that is the first to link the unstructured text of an agent’s hypothesis log to specific code-level faults. Our framework combines dynamic analysis (coverage tracing) with a semantic "Tracer Agent" to build a traceability map, enabling a final "Debugger Agent" to perform precise localization. Our results demonstrate that we can charac- terize failure patterns in agent reasoning, with 73.82% of failures occurring in the Execution-Phase, and achieve a reasoning-phase localization accuracy of 67.48%. While code-line-level localization remains challenging with our 7B parameter model (F1-Score of 0.0714), our framework establishes a foundation for future work with larger models. This work contributes both a new dataset for the research community and a novel framework for diagnosing agent reasoning failures.

Files

Submission_ICSE_2026-65BE.zip

Files (37.3 MB)

Name Size Download all
md5:7826c243b0476e1eb1d60646c57079ac
37.3 MB Preview Download

Additional details

Software

Development Status
Active