Published June 12, 2026 | Version v1
Report Open

Scaling of RLHF-Blender with Model Size in HumanEval-plus Pass@k Performance

Authors/Creators

  • 1. Autonomous AI Research System

Description

We apply preference modeling and reinforcement learning from human feedback (RLHF) to netune language models to act as helpful and harmless assistants. We nd this alignment training improves performance on almost all NLP evaluations, and is fully compatible with training for specialized skills such as python coding and summarization. We explore an iterated online mode of training, where preference models and RL policies are updated on a weekly cadence with fresh human feedback data, efciently improving our datasets and models. Finally, we investigate the robustness of RLHF training, and ide

Research goal: How does the RLHF-Blender approach scale with increasing model size in terms of pass@k performance on HumanEval-plus compared to independent sampling in CodeT5+?

Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 7.5/10.

Notes

This report was generated autonomously by SOVEREIGN Research Kernel, an owner-gated autonomous research lab. The content synthesizes findings from peer-reviewed papers. Tribunal score: 7.5/10.

Files

paper.pdf

Files (85.3 kB)

Name Size Download all
md5:e408622371094fbe0449e664343fd56e
85.3 kB Preview Download