On-Policy Oracle Injection for Fine-Tuning Tool-Using Agents
Description
Fine-tuning a tool-using AI agent is fundamentally different from fine-tuning a general purpose LLM, and is identified as an open problem in the on-policy distillation literature (Song & Zheng, 2026). The agent operates within a narrow distribution defined by its system prompt, available tools, and data access—a distribution the base model was never trained on. When a human expert corrects the agent’s decision, the standard approach— having a corrector LLM generate the corrected output—fails to teach the agent’s distribution, because the corrector is not the agent. In the context of a tool-using agent, even the same LLM is off-policy if it is not the same agent that generates the correction.
We introduce On-Policy Oracle Injection (OPOI): inject the expert’s verdict into the agent’s prompt as a minimal steering signal, run the actual agent end-to-end with its real tools, capture the oracle-guided trace, and clean the oracle references before training. We identify three failure modes of the off-policy alternative specific to tool-using agents— vocabulary mismatch, information leakage, and conflated optimisation targets—and show how OPOI addresses each.
Empirically, we compare two production models fine-tuned on the same task: one trained on OPOI data, one on Off-Policy Corrector Distillation (OPCD) data. The OPOI model generates with 16% higher confidence on identical validation prompts, closing 49% of the base → perfect confidence gap versus 37% for OPCD. Both achieve comparable task accuracy (+3% over baseline), but the OPOI model is significantly more consistent across runs (consensus 0.894 vs 0.854). Both methods teach the right answers; only on-policy training teaches the agent’s distribution.
Files
On Policy Oracle Injection for Fine Tuning Tool Using Agents.pdf
Files
(411.0 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:94eeaf3bfbbdff9e34e66709a5d00f6a
|
411.0 kB | Preview Download |