Scaling Effects on Adversarial Robustness in Contrastive vs. MLM Pretraining for Code Generation
Description
Retrieval-augmented generation (RAG) improves language model (LM) performance by providing relevant context at test time for knowledge-intensive situations. However, the relationship between parametric knowledge acquired during pretraining and non-parametric knowledge accessed via retrieval remains poorly understood, especially under fixed data budgets. In this work, we systematically study the trade-off between pretraining corpus size and retrieval store size across a wide range of model and data scales. We train OLMo-2-based LMs ranging from 30M to 3B parameters on up to 100B tokens of DCLM
Research goal: How does the scaling of model size affect the adversarial robustness gap between contrastive pretraining and MLM pretraining for code generation, as measured by accuracy on the HumanEvalFix benchmark under increasing perturbation magnitudes?
Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 8.7/10.
Notes
Files
paper.pdf
Files
(91.0 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:e3d2ac9a616acfaefe9c912605b8b238
|
91.0 kB | Preview Download |