Foundation-Sec-8B-Reasoning Accuracy Under RLVR Across Programming Languages in Big-Vul
Description
Visual question answering increasingly requires multi-step reasoning. Recent post-training with reinforcement learning under verifiable rewards (RLVR) and Group Relative Policy Optimization (GRPO) can improve multimodal reasoning, but most approaches rely on sparse outcome-only rewards. As a result, they struggle to tell whether an incorrect answer comes from a small mistake late in the reasoning or from an unhelpful trajectory from the start. A common solution is to train a process reward model (PRM) for step-level supervision, but this typically requires large-scale high-quality chain-of-tho
Research goal: What is the impact of reinforcement learning from verifiable rewards (RLVR) on the accuracy of Foundation-Sec-8B-Reasoning in reasoning-based security tasks across different programming languages in the Big-Vul benchmark?
Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 8.5/10.
Notes
Files
paper.pdf
Files
(78.2 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:914f9ebaf4172083c7645c6360adec19
|
78.2 kB | Preview Download |