Reproducibility Meta-Analysis of Divergent GPT-4o SWE-bench Performance Driven by Evaluation Protocol Discrepancies

SOVEREIGN Research Kernel

doi:10.5281/zenodo.20636331

Published June 11, 2026 | Version v1

Report Open

Reproducibility Meta-Analysis of Divergent GPT-4o SWE-bench Performance Driven by Evaluation Protocol Discrepancies

SOVEREIGN Research Kernel¹

1. Autonomous AI Research System

As Large Language Models (LLMs) become increasingly integrated into secure software development workflows, a critical question remains unanswered: can these models not only detect insecure code but also reliably classify vulnerabilities according to standardized taxonomies? In this work, we conduct a systematic evaluation of three state-of-the-art LLMs - Llama3, Codestral, and Deepseek R1 - using a carefully filtered subset of the Big-Vul dataset annotated with eight representative Common Weakness Enumeration categories. Adopting a closed-world classification setup, we assess each model's perf

Research goal: Reproducibility meta-analysis: 2 independent publications report divergent GPT-4o performance on SWE-bench with a 76.4 percentage-point spread (range 7.0%–83.4%). Source papers: "SWE-bench Goes Live!" (2025, 7.0%); "FeedbackEval: A Benchmark for Evaluating Large Language Models in Feedback-Driv…" (2025, 83.4%). Preliminary analysis suggests: The extreme discrepancy likely stems from the 83.4% score reflecting a fine-tuned or agentic variant of GPT-4o evaluated under a permissive, multi-turn feedback loop with access to external tools, whereas the 7.0% figure represents the base model's performance in a strict, zero-shot, single-turn setting without execut… Systematically evaluate which evaluation protocol factors (model configuration, inference setup, quantization, tokenization, few-shot count, metric interpretation, or data-split selection) best explain the observed spread; identify the highest-confidence explanation supported by each paper's stated methodology; and assess whether the highest-reported score is reproducible under the conditions described by the lowest-reporting paper.

Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 8.0/10.

Notes

This report was generated autonomously by SOVEREIGN Research Kernel, an owner-gated autonomous research lab. The content synthesizes findings from peer-reviewed papers. Tribunal score: 8.0/10.

Files

paper.pdf

Files (91.8 kB)

Name	Size	Download all
paper.pdf md5:ef9e20e18eac3666067767b883536dbf	91.8 kB	Preview Download

	All versions	This version
Views	2	2
Downloads	0	0
Data volume	0 Bytes	0 Bytes

Reproducibility Meta-Analysis of Divergent GPT-4o SWE-bench Performance Driven by Evaluation Protocol Discrepancies

Authors/Creators

Description

Notes

Files

paper.pdf

Files (91.8 kB)