Architecture Determines Observability in Transformers
Authors/Creators
Description
This is an early development version. Results, citations, and analysis may be incomplete or incorrect. For the current version, see: https://doi.org/10.5281/zenodo.19435674
...
At 20% flag rate on MedQA-USMLE, a linear probe on frozen mid-layer activations catches 13.4% of confidently wrong answers, roughly one in seven of the mistakes the model makes on medical licensing questions. Output confidence and a trained predictor on the full last-layer representation mark those same answers correct. The probe was trained zero-shot on Wikipedia next-token prediction and applied without task-specific data. A single dot product per token.
Whether this signal exists depends on architecture. Under Pythia's controlled training, three (24 layers, 16 heads) configurations collapse to partial correlation near 0.10 across a 3.5x parameter gap, 2x hidden dimension, 2x head dimension, and two Pile variants. Six other Pythia sizes, spanning depths 6 to 36 and head counts 8 to 40, produce values between +0.21 and +0.38 with no intermediate points observed. The pattern replicates under Llama's recipe at a different configuration: 1B at +0.286, while 3B and 8B collapse. Across 13 models in 6 families, family membership captures eta-squared = 0.92 (permutation p = 0.006); log parameter count has no effect.
Key results
- Pythia (24L, 16H) collapse replicates at three points: 410M +0.105, 1.4B +0.106, 1.4B-deduped +0.100 (range 0.006 across a 3.5x parameter gap and the Pile deduplication boundary)
- Within-Pythia exact permutation: 2-vs-6 p = 0.036 (1/28); tightens to p = 0.012 (1/84) with 1.4B-deduped as a third collapsed point
- Shuffle test on Pythia 1.4B at layer 17: real probe +0.106 exceeds all 10 shuffled-label permutations (null -0.002 +/- 0.036, 3.0 sigma)
- Cross-family permutation: F = 15.77, p = 0.006, eta-squared = 0.92 (13 models, 6 families)
- Llama cliff at a different configuration: 1B (16L/32H) +0.286, 3B (28L/24H) +0.091, 8B (32L/32H) +0.093
- Matched 3B scale: Qwen 2.5 +0.263 vs Llama 3.2 +0.091, 2.9x gap with non-overlapping per-seed distributions
- 48.6-75.8% of raw probe-loss correlation redundant with max softmax and activation norm across 12 of 13 models
- 20-seed hardening on GPT-2 124M: +0.282 +/- 0.001, seed agreement +0.993
- 512-unit MLP on the last-layer representation absorbs no more signal than a 64-unit bottleneck at any tested scale
- Nonlinear probes at the three collapse points (Llama 3B, Pythia 410M, Pythia 1.4B) stay below the +0.21 healthy floor
- Output-controlled residual r_OC follows the same architecture split: healthy Pythia configurations +0.088 to +0.169, three collapse points at zero or negative r_OC
- Zero-shot downstream: 7 of 9 model-task cells at 20% flag rate fall between 10.9% and 13.4% on SQuAD 2.0 RAG, MedQA-USMLE, and TruthfulQA (Qwen 2.5 7B-I, Mistral 7B-I, Phi-3 Mini-I)
What changed since v2.4.0
- Pythia controlled-training suite: 8 sizes (70M to 12B), plus 1.4B-deduped (corpus control) and 1.4B shuffled-label run (probe-floor control). Within-recipe causation at the configuration-class level
- Three-model downstream evaluation on SQuAD 2.0, MedQA-USMLE, and TruthfulQA. Each model uses its own WikiText-trained probe with no task-specific retraining
- Nonlinear probe comparison at three collapse configurations with held-out train/validation/test splits: best-achievable MLP scores stay below the +0.21 healthy floor
- Mean-ablation patching on Llama 3.2 1B vs 3B: layer-1 MLP flips sign on the observer score across the cliff (sign pattern only; magnitudes unreliable under early-layer loss damage)
- Output-predictor width sweep (64 to 512 units on Qwen 7B) and 30-resample document-level bootstrap
- v3 protocol across 13 models: 7-seed evaluation, token budgets matched by hidden dimension
- Title changed from "Architecture Predicts Linear Readability of Decision Quality in Transformers" to "Architecture Determines Observability in Transformers"
- Pipeline: auto-generated macros and tables from results JSONs; 10-layer validation (just check) runs before each release
- 21 committed results JSONs (was 13)
Reproducibility
All results committed as JSON in results/. Statistical analysis reproduces from committed data without GPU:
cd nn-observability && python analysis/run_all.py
New model evaluation (any HuggingFace causal LM):
python scripts/run_model.py --model <model-id> --output <name>_results.json
Versioning
v3.0.0 is a new version in the same concept DOI chain (zenodo.19435674) as v2.4.0. Readers who cite the version-specific DOI get this release; readers who cite the concept DOI get whatever is latest. Code repository: https://github.com/tmcarmichael/nn-observability.
Notes
Files
architecture-determines-observability-2026-pre-print-v3.0.0.pdf
Files
(2.6 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:b83364bdcbf850d361437ee91b78748d
|
1.1 MB | Preview Download |
|
md5:f26d7f2bcf6028f89d81b143f69fc9ce
|
1.5 MB | Preview Download |
Additional details
Software
- Repository URL
- https://github.com/tmcarmichael/nn-observability
- Programming language
- Python
- Development Status
- Active
References
- Abdin, M., Aneja, J., Awadalla, H., et al. (2024). Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. arXiv:2404.14219.
- Afzal, M., Matthes, F., Chechik, G., & Ziser, Y. (2025). Knowing Before Saying: LLM Representations Encode Information About Chain-of-Thought Success Before Completion. Findings of ACL, 12791-12806.
- Alain, G., & Bengio, Y. (2017). Understanding intermediate layers using linear classifier probes. arXiv:1610.01644.
- Belinkov, Y. (2022). Probing Classifiers: Promises, Shortcomings, and Advances. Computational Linguistics, 48(1), 207-219.
- Biderman, S., Schoelkopf, H., Anthony, Q., et al. (2023). Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling. ICML.
- Bricken, T., Templeton, A., Batson, J., et al. (2023). Towards Monosemanticity: Decomposing Language Models With Dictionary Learning. Transformer Circuits Thread.
- Burger, L., Hamprecht, F. A., & Nadler, B. (2024). Truth is Universal: Robust Detection of Lies in LLMs. NeurIPS.
- Burns, C., Ye, H., Klein, D., & Steinhardt, J. (2023). Discovering Latent Knowledge in Language Models Without Supervision. ICLR.
- Gemma Team. (2024). Gemma: Open Models Based on Gemini Research and Technology. arXiv:2403.08295.
- Grattafiori, A., Dubey, A., Jauhri, A., et al. (2024). The Llama 3 Herd of Models. arXiv:2407.21783.
- Guan, M. Y., Wang, M., Carroll, M., et al. (2025). Monitoring Monitorability. arXiv:2512.18311.
- Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). On Calibration of Modern Neural Networks. ICML, 1321-1330.
- Hewitt, J., & Liang, P. (2019). Designing and Interpreting Probes with Control Tasks. EMNLP-IJCNLP, 2733-2743.
- Honovich, O., Aharoni, R., Herzig, J., et al. (2022). TRUE: Re-evaluating Factual Consistency Evaluation. NAACL, 3905-3920.
- Huang, L., et al. (2025). Calibrating Verbal Uncertainty as a Linear Feature to Reduce Hallucinations. arXiv:2503.14477.
- Jiang, A. Q., Sablayrolles, A., Mensch, A., et al. (2023). Mistral 7B. arXiv:2310.06825.
- Kadavath, S., Conerly, T., Askell, A., et al. (2022). Language Models (Mostly) Know What They Know. arXiv:2207.05221.
- Korbak, T., Balesni, M., Barnes, E., et al. (2025). Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety. arXiv:2507.11473.
- Kossen, J., Han, J., Farquhar, S., Gal, Y., & Kuhn, L. (2024). Semantic Entropy Probes: Robust and Cheap Hallucination Detection in LLMs. arXiv:2406.15927.
- Kramar, J., Engels, J., Wang, Z., et al. (2026). Building Production-Ready Probes For Gemini. arXiv:2601.11516.
- Kuhn, L., Gal, Y., & Farquhar, S. (2023). Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation. ICLR.
- Li, K., Patel, O., ViƩgas, F., Pfister, H., & Wattenberg, M. (2023). Inference-Time Intervention: Eliciting Truthful Answers from a Language Model. NeurIPS.
- Marks, S., & Tegmark, M. (2023). The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets. arXiv:2310.06824.
- McGuinness, M., Serrano, A., Bailey, L., & Emmons, S. (2025). Neural Chameleons: Language Models Can Learn to Hide Their Thoughts from Unseen Activation Monitors. arXiv:2512.11949.
- Meng, K., Bau, D., Andonian, A., & Belinkov, Y. (2022). Locating and Editing Factual Associations in GPT. NeurIPS.
- Merity, S., Xiong, C., Bradbury, J., & Socher, R. (2017). Pointer Sentinel Mixture Models. arXiv:1609.07843.
- Min, S., Krishna, K., Lyu, X., et al. (2023). FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation. EMNLP.
- Oldfield, J., Torr, P., Patras, I., Bibi, A., & Barez, F. (2026). Beyond Linear Probes: Dynamic Safety Monitoring for Language Models. ICLR.
- Rozenfeld, S., Pankajakshan, R., Zloczower, I., Lenga, E., Gressel, G., & Mirsky, Y. (2026). GAVEL: Towards Rule-Based Safety through Activation Monitoring. ICLR.
- Schuster, T., Fisch, A., Gupta, J. P., et al. (2022). Confident Adaptive Language Modeling. NeurIPS.
- Sharkey, L., Chughtai, B., Batson, J., Lindsey, J., Wu, J., et al. (2025). Open Problems in Mechanistic Interpretability. arXiv:2501.16496.
- Wen, B., Yao, Y., Feng, Z., et al. (2025). Know Your Limits: A Survey of Abstention in Large Language Models. TACL, 13, 529-556.
- Yang, A., Yang, B., Hui, B., et al. (2024). Qwen2 Technical Report. arXiv:2407.10671.
- Zhang, A., Chen, Y., Pan, J., et al. (2025). Reasoning Models Know When They're Right: Probing Hidden States for Self-Verification. arXiv:2504.05419.