Published June 3, 2026 | Version v1
Preprint Open

A Single LLM Is an Incomplete Code Reviewer: Evidence that Independent Review by Multiple LLM Families Recovers Code Defects Any One Model Misses

Authors/Creators

Description

Teams increasingly route code review through a single large language model (LLM). We test whether one model's code review is complete relative to independent reviews by several different model families, using a live software team's corpus with a human-reconciled answer key: 18 code/mixed artifacts, 154 confirmed issues, reviewed by eight model versions across five providers (April to June 2026). For each artifact we measure per-model recall against the reconciled confirmed-issue set (a deliberately generous denominator), with Wilson 95% confidence intervals, pairwise cross-family overlap (Jaccard), per-model unique contributions, and a provider-coverage curve. No single model exceeded about 64% recall on code; a typical model caught roughly half of confirmed defects. 56.5% of confirmed defects (87 of 154) were found by exactly one model, cross-family overlap was low (median Jaccard about 0.37), and the coverage curve shows the largest marginal gain from adding a second, different-provider model (33.6% to 57.1%), with diminishing returns thereafter and no provider redundant. We could not establish that repeated passes vary, that newer versions detect more, or any fine ranking among the seven non-weakest versions, and we decline to assert them. Within this single-organization case study, a single LLM pass is an incomplete code review, and independent, different-family review recovers the gap. We recommend running a small panel of different-provider models independently, reconciling with a human who verifies findings against source, and expecting roughly half to two-thirds single-model code recall. Conclusions are directional, not a universal benchmark.

Files

LLM_Code_Reviews.pdf

Files (230.5 kB)

Name Size Download all
md5:75359ecd4c9007938345d9c8d8eb873d
230.5 kB Preview Download