Replication package for ASE
Authors/Creators
Description
Code review is a critical quality assurance practice in software engineering development, and AI coding agents are increasingly generating review comments on pull requests. However, little is known about how developers actually respond to such agent-generated feedback. In this paper, we present the first large-scale empirical study on the resolution of agent-generated code review comments. We analyze $54{,}713$ comments generated by three widely used coding agents (i.e., Copilot, Cursor, and Codex) across $341$ Python repositories on GitHub. We examine (1) resolution rates across agents and comment types, (2) the role of developer experience, and (3) characteristics that influence comment usefulness. Our results show that resolution rate varies considerably across agents, with Copilot accounting for the majority of resolved comments (72.9\%). Core developers resolve the majority of agent-generated feedback, particularly for \textit{design} and \textit{evolvability}-related comments, while peripheral developers are more involved in resolving \textit{functional defect} comments. Through open card sorting of 470 unresolved comment discussions, we identify \textit{ten} discussion patterns explaining why comments remain unresolved, with \textit{incorrect suggestions} and \textit{intentional design decisions} being the most prevalent. Finally, our analysis reveals that the presence of an inline \textit{code suggestion} is the strongest predictor of comment usefulness, while providing rules, examples, and benefit-based explanations also modestly increase the likelihood of adoption. Our findings provide insights for improving agent-generated code review feedback and its integration into development workflows.
Files
agent-review-repos.csv
Files
(22.9 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:4c5c9d855addceb63ed996bb668fa63f
|
22.9 kB | Preview Download |
Additional details
Dates
- Submitted
-
2026-03-27