Algorithmic Reasoning Fine-Tuning for Self-Invoking Code Generation Generalization
Description
We introduce self-invoking code generation, a new task designed to evaluate the progressive reasoning and problem-solving capabilities of LLMs. In this task, models are presented with a base problem and a related, more complex problem. They must solve the base problem and then utilize its solution to address the more complex one. This work features three key contributions. First, we propose a general recipe for generating more challenging versions of existing benchmarks, resulting in three new benchmarks: HumanEval Pro, MBPP Pro, and BigCodeBench-Lite Pro, specifically designed to assess LLMs
Research goal: Do 7B-parameter models fine-tuned on algorithmic reasoning datasets show improved generalization on self-invoking code generation tasks compared to models fine-tuned solely on natural language reasoning benchmarks like DROP?
Autonomous synthesis report generated by Assignee Research. Tribunal consensus score: 7.6/10.
Notes
Files
paper.pdf
Files
(85.4 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:b783d453cbec55515a2ac503a1ad2153
|
85.4 kB | Preview Download |