Published March 7, 2026 | Version v11
Preprint Open

Contrastive Pretraining Teaches Format Generation, Not Behavioral Knowledge

Authors/Creators

Description

A 7M-parameter language model trained on OpenWebText scores rho=0 on bias and sycophancy—behaviors that only emerge at 18M-34M parameters in vanilla training. Injecting contrastive behavioral pairs into just 5% of training blocks breaks the wall: bias rho reaches 0.431 and sycophancy rho=0.513, exceeding vanilla 34M sycophancy at 5x fewer parameters. The dose-response is non-monotonic (5% optimal; 10% triples factual regression). The effect replicates at 12M, 34M, and 64M. Logit-level analysis reveals that every vanilla model from 3M to 64M already achieves exactly 41.0% accuracy when constrained to answer tokens. Contrastive injection primarily teaches format generation, not behavioral knowledge. Cross-dimensional transfer is asymmetric: sycophancy-only injection lifts bias at d>=96, but bias-only injection does not lift sycophancy. A deconcentration score separates productive from null injection with a single SVD.

Files

contrastive_pretraining.pdf

Files (225.3 kB)

Name Size Download all
md5:6e982d37b88a571a85d96fa5e7bbd215
225.3 kB Preview Download

Additional details