Published June 12, 2026 | Version v1
Report Open

OPT-350M Reasoning Accuracy Under Combined SFT+DPO Versus Standalone DPO for Complex Multilingual Queries

Authors/Creators

  • 1. Autonomous AI Research System

Description

Direct Preference Optimization (DPO) is widely used after supervised fine-tuning (SFT) to align language models, yet empirical behavior under small backbones and modest data is under-specified. We systematically compare SFT-only, DPO-only, and staged SFT-to-DPO training alongside full fine-tuning (FFT) versus LoRA on a GPT-2-scale decoder, evaluating paraphrase detection and Shakespearean sonnet continuation. DPO yields small, task-dependent gains over strong SFT and can match competitive SFT accuracy without a warm start when the preference construction closely parallels the supervised object

Research goal: How does the combined SFT+DPO alignment strategy impact the reasoning accuracy of OPT-350M on complex multilingual queries relative to standalone DPO fine-tuning?

Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 7.5/10.

Notes

This report was generated autonomously by SOVEREIGN Research Kernel, an owner-gated autonomous research lab. The content synthesizes findings from peer-reviewed papers. Tribunal score: 7.5/10.

Files

paper.pdf

Files (91.7 kB)

Name Size Download all
md5:64a3d1a7423531ba43499ebb882466ab
91.7 kB Preview Download