OPT-350M Reasoning Accuracy Under Combined SFT+DPO Versus Standalone DPO for Complex Multilingual Queries
Description
Direct Preference Optimization (DPO) is widely used after supervised fine-tuning (SFT) to align language models, yet empirical behavior under small backbones and modest data is under-specified. We systematically compare SFT-only, DPO-only, and staged SFT-to-DPO training alongside full fine-tuning (FFT) versus LoRA on a GPT-2-scale decoder, evaluating paraphrase detection and Shakespearean sonnet continuation. DPO yields small, task-dependent gains over strong SFT and can match competitive SFT accuracy without a warm start when the preference construction closely parallels the supervised object
Research goal: How does the combined SFT+DPO alignment strategy impact the reasoning accuracy of OPT-350M on complex multilingual queries relative to standalone DPO fine-tuning?
Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 7.5/10.
Notes
Files
paper.pdf
Files
(91.7 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:64a3d1a7423531ba43499ebb882466ab
|
91.7 kB | Preview Download |