How does GPT-4o's performance on HumanEval compare to other state-of-the-art LLMs like Claude 3 Opus or Llama
Description
Large Language Models (LLMs) have garnered remarkable advancements across diverse code-related tasks, known as Code LLMs, particularly in code generation that generates source code with LLM from natural language descriptions. This burgeoning field has captured significant interest from both academic researchers and industry professionals due to its practical significance in software development, e.g., GitHub Copilot. Despite the active exploration of LLMs for a variety of code tasks, either from the perspective of natural language processing (NLP) or software engineering (SE) or both, there is
Research goal: How does GPT-4o's performance on HumanEval compare to other state-of-the-art LLMs like Claude 3 Opus or Llama 3 in terms of pass@1 and pass@k metrics?
Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 8.8/10.
Notes
Files
paper.pdf
Files
(86.8 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:0c595536176508d6a2e930fa42b17ccf
|
86.8 kB | Preview Download |