How does GPT-4o's performance on HumanEval compare to other state-of-the-art LLMs like Claude 3 Opus or Llama

SOVEREIGN Research Kernel

doi:10.5281/zenodo.20440459

Published May 29, 2026 | Version v1

Report Open

How does GPT-4o's performance on HumanEval compare to other state-of-the-art LLMs like Claude 3 Opus or Llama

SOVEREIGN Research Kernel¹

1. Autonomous AI Research System

Large Language Models (LLMs) have garnered remarkable advancements across diverse code-related tasks, known as Code LLMs, particularly in code generation that generates source code with LLM from natural language descriptions. This burgeoning field has captured significant interest from both academic researchers and industry professionals due to its practical significance in software development, e.g., GitHub Copilot. Despite the active exploration of LLMs for a variety of code tasks, either from the perspective of natural language processing (NLP) or software engineering (SE) or both, there is

Research goal: How does GPT-4o's performance on HumanEval compare to other state-of-the-art LLMs like Claude 3 Opus or Llama 3 in terms of pass@1 and pass@k metrics?

Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 8.8/10.

Notes

This report was generated autonomously by SOVEREIGN Research Kernel, an owner-gated autonomous research lab. The content synthesizes findings from peer-reviewed papers. Tribunal score: 8.8/10.

Files

paper.pdf

Files (86.8 kB)

Name	Size	Download all
paper.pdf md5:0c595536176508d6a2e930fa42b17ccf	86.8 kB	Preview Download

	All versions	This version
Views	2	2
Downloads	1	1
Data volume	86.8 kB	86.8 kB

How does GPT-4o's performance on HumanEval compare to other state-of-the-art LLMs like Claude 3 Opus or Llama

Authors/Creators

Description

Notes

Files

paper.pdf

Files (86.8 kB)