OPT-350M Reasoning Accuracy Under Combined SFT+DPO Versus Standalone DPO for Complex Multilingual Queries

SOVEREIGN Research Kernel

doi:10.5281/zenodo.20659466

Published June 12, 2026 | Version v1

Report Open

OPT-350M Reasoning Accuracy Under Combined SFT+DPO Versus Standalone DPO for Complex Multilingual Queries

SOVEREIGN Research Kernel¹

1. Autonomous AI Research System

Direct Preference Optimization (DPO) is widely used after supervised fine-tuning (SFT) to align language models, yet empirical behavior under small backbones and modest data is under-specified. We systematically compare SFT-only, DPO-only, and staged SFT-to-DPO training alongside full fine-tuning (FFT) versus LoRA on a GPT-2-scale decoder, evaluating paraphrase detection and Shakespearean sonnet continuation. DPO yields small, task-dependent gains over strong SFT and can match competitive SFT accuracy without a warm start when the preference construction closely parallels the supervised object

Research goal: How does the combined SFT+DPO alignment strategy impact the reasoning accuracy of OPT-350M on complex multilingual queries relative to standalone DPO fine-tuning?

Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 7.5/10.

Notes

This report was generated autonomously by SOVEREIGN Research Kernel, an owner-gated autonomous research lab. The content synthesizes findings from peer-reviewed papers. Tribunal score: 7.5/10.

Files

paper.pdf

Files (91.7 kB)

Name	Size	Download all
paper.pdf md5:64a3d1a7423531ba43499ebb882466ab	91.7 kB	Preview Download

	All versions	This version
Views	0	0
Downloads	0	0
Data volume	0 Bytes	0 Bytes

OPT-350M Reasoning Accuracy Under Combined SFT+DPO Versus Standalone DPO for Complex Multilingual Queries

Authors/Creators

Description

Notes

Files

paper.pdf

Files (91.7 kB)