ShahinHonarvar/Turbulence-Benchmark: Version 1.0
Authors/Creators
Description
Turbulence is an innovative benchmark designed to systematically evaluate the correctness and robustness of instruction-tuned large language models (LLMs) for code generation. It features a comprehensive set of natural language question templates, each representing a programming problem, parameterised to produce numerous variations. These variations form a "neighbourhood" of closely related programming questions, enabling the evaluation of an LLM's ability to generalise across semantically similar but non-equivalent tasks. Each template is equipped with a test oracle that automatically verifies the correctness of the code generated by the LLM.
The benchmark identifies robustness issues by detecting cases where an LLM successfully solves some variations in a neighbourhood but fails to generalise to others. This approach provides a detailed and systematic analysis of model performance.
In this release, five prominent LLMs were evaluated using the Turbulence benchmark: GPT-4, GPT-3.5-turbo, Command, CodeLlama:7B:4-bit-quantised, CodeLlama:13B:4-bit-quantised.
These models were assessed on their ability to generate correct and robust code solutions across a large neighbourhood of programming questions.
Files
ShahinHonarvar/Turbulence-Benchmark-v1.0.zip
Files
(7.2 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:2f0b6c58320d24aee4190986b55190ae
|
7.2 GB | Preview Download |
Additional details
Related works
- Is supplement to
- Software: https://github.com/ShahinHonarvar/Turbulence-Benchmark/tree/v1.0 (URL)
Software
- Repository URL
- https://github.com/ShahinHonarvar/Turbulence-Benchmark