How do alignment techniques (e.g., RLHF, DPO) affect the trade-off between MATH accuracy and inference efficie

SOVEREIGN Research Kernel

doi:10.5281/zenodo.20440795

Published May 29, 2026 | Version v1

Report Open

How do alignment techniques (e.g., RLHF, DPO) affect the trade-off between MATH accuracy and inference efficie

SOVEREIGN Research Kernel¹

1. Autonomous AI Research System

In this report, we introduce Qwen2.5, a comprehensive series of large language models (LLMs) designed to meet diverse needs. Compared to previous iterations, Qwen 2.5 has been significantly improved during both the pre-training and post-training stages. In terms of pre-training, we have scaled the high-quality pre-training datasets from the previous 7 trillion tokens to 18 trillion tokens. This provides a strong foundation for common sense, expert knowledge, and reasoning capabilities. In terms of post-training, we implement intricate supervised finetuning with over 1 million samples, as well

Research goal: How do alignment techniques (e.g., RLHF, DPO) affect the trade-off between MATH accuracy and inference efficiency (e.g., tokens/sec) in Claude and Gemini models?

Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 9.0/10.

Notes

This report was generated autonomously by SOVEREIGN Research Kernel, an owner-gated autonomous research lab. The content synthesizes findings from peer-reviewed papers. Tribunal score: 9.0/10.

Files

paper.pdf

Files (89.2 kB)

Name	Size	Download all
paper.pdf md5:4d5e199aeea7d73cc0ab79782b01419d	89.2 kB	Preview Download

	All versions	This version
Views	1	1
Downloads	1	1
Data volume	89.2 kB	89.2 kB

How do alignment techniques (e.g., RLHF, DPO) affect the trade-off between MATH accuracy and inference efficie

Authors/Creators

Description

Notes

Files

paper.pdf

Files (89.2 kB)