Published January 24, 2025 | Version v1.0
Software Open

ShahinHonarvar/Turbulence-Benchmark: Version 1.0

Description

Turbulence is an innovative benchmark designed to systematically evaluate the correctness and robustness of instruction-tuned large language models (LLMs) for code generation. It features a comprehensive set of natural language question templates, each representing a programming problem, parameterised to produce numerous variations. These variations form a "neighbourhood" of closely related programming questions, enabling the evaluation of an LLM's ability to generalise across semantically similar but non-equivalent tasks. Each template is equipped with a test oracle that automatically verifies the correctness of the code generated by the LLM.

The benchmark identifies robustness issues by detecting cases where an LLM successfully solves some variations in a neighbourhood but fails to generalise to others. This approach provides a detailed and systematic analysis of model performance.

In this release, five prominent LLMs were evaluated using the Turbulence benchmark: GPT-4, GPT-3.5-turbo, Command, CodeLlama:7B:4-bit-quantised, CodeLlama:13B:4-bit-quantised.

These models were assessed on their ability to generate correct and robust code solutions across a large neighbourhood of programming questions.

Files

ShahinHonarvar/Turbulence-Benchmark-v1.0.zip

Files (7.2 GB)

Name Size Download all
md5:2f0b6c58320d24aee4190986b55190ae
7.2 GB Preview Download

Additional details

Related works