Published March 7, 2026
| Version v11
Preprint
Open
Contrastive Pretraining Teaches Format Generation, Not Behavioral Knowledge
Authors/Creators
Description
A 7M-parameter language model trained on OpenWebText scores rho=0 on bias and sycophancy—behaviors that only emerge at 18M-34M parameters in vanilla training. Injecting contrastive behavioral pairs into just 5% of training blocks breaks the wall: bias rho reaches 0.431 and sycophancy rho=0.513, exceeding vanilla 34M sycophancy at 5x fewer parameters. The dose-response is non-monotonic (5% optimal; 10% triples factual regression). The effect replicates at 12M, 34M, and 64M. Logit-level analysis reveals that every vanilla model from 3M to 64M already achieves exactly 41.0% accuracy when constrained to answer tokens. Contrastive injection primarily teaches format generation, not behavioral knowledge. Cross-dimensional transfer is asymmetric: sycophancy-only injection lifts bias at d>=96, but bias-only injection does not lift sycophancy. A deconcentration score separates productive from null injection with a single SVD.
Files
contrastive_pretraining.pdf
Files
(225.3 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:6e982d37b88a571a85d96fa5e7bbd215
|
225.3 kB | Preview Download |
Additional details
Related works
- Is supplemented by
- https://github.com/SolomonB14D3/knowledge-fidelity/tree/v2.3.1 (URL)