Scaling of RLHF-Blender with Model Size in HumanEval-plus Pass@k Performance
Description
We apply preference modeling and reinforcement learning from human feedback (RLHF) to netune language models to act as helpful and harmless assistants. We nd this alignment training improves performance on almost all NLP evaluations, and is fully compatible with training for specialized skills such as python coding and summarization. We explore an iterated online mode of training, where preference models and RL policies are updated on a weekly cadence with fresh human feedback data, efciently improving our datasets and models. Finally, we investigate the robustness of RLHF training, and ide
Research goal: How does the RLHF-Blender approach scale with increasing model size in terms of pass@k performance on HumanEval-plus compared to independent sampling in CodeT5+?
Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 7.5/10.
Notes
Files
paper.pdf
Files
(85.3 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:e408622371094fbe0449e664343fd56e
|
85.3 kB | Preview Download |