There is a newer version of the record available.

Published April 23, 2026 | Version v3.0.0
Preprint Open

Architecture Determines Observability in Transformers

Authors/Creators

Description

This is an early development version. Results, citations, and analysis may be incomplete or incorrect. For the current version, see: https://doi.org/10.5281/zenodo.19435674

...

At 20% flag rate on MedQA-USMLE, a linear probe on frozen mid-layer activations catches 13.4% of confidently wrong answers, roughly one in seven of the mistakes the model makes on medical licensing questions. Output confidence and a trained predictor on the full last-layer representation mark those same answers correct. The probe was trained zero-shot on Wikipedia next-token prediction and applied without task-specific data. A single dot product per token.

Whether this signal exists depends on architecture. Under Pythia's controlled training, three (24 layers, 16 heads) configurations collapse to partial correlation near 0.10 across a 3.5x parameter gap, 2x hidden dimension, 2x head dimension, and two Pile variants. Six other Pythia sizes, spanning depths 6 to 36 and head counts 8 to 40, produce values between +0.21 and +0.38 with no intermediate points observed. The pattern replicates under Llama's recipe at a different configuration: 1B at +0.286, while 3B and 8B collapse. Across 13 models in 6 families, family membership captures eta-squared = 0.92 (permutation p = 0.006); log parameter count has no effect.

Key results

- Pythia (24L, 16H) collapse replicates at three points: 410M +0.105, 1.4B +0.106, 1.4B-deduped +0.100 (range 0.006 across a 3.5x parameter gap and the Pile deduplication boundary)
- Within-Pythia exact permutation: 2-vs-6 p = 0.036 (1/28); tightens to p = 0.012 (1/84) with 1.4B-deduped as a third collapsed point
- Shuffle test on Pythia 1.4B at layer 17: real probe +0.106 exceeds all 10 shuffled-label permutations (null -0.002 +/- 0.036, 3.0 sigma)
- Cross-family permutation: F = 15.77, p = 0.006, eta-squared = 0.92 (13 models, 6 families)
- Llama cliff at a different configuration: 1B (16L/32H) +0.286, 3B (28L/24H) +0.091, 8B (32L/32H) +0.093
- Matched 3B scale: Qwen 2.5 +0.263 vs Llama 3.2 +0.091, 2.9x gap with non-overlapping per-seed distributions
- 48.6-75.8% of raw probe-loss correlation redundant with max softmax and activation norm across 12 of 13 models
- 20-seed hardening on GPT-2 124M: +0.282 +/- 0.001, seed agreement +0.993
- 512-unit MLP on the last-layer representation absorbs no more signal than a 64-unit bottleneck at any tested scale
- Nonlinear probes at the three collapse points (Llama 3B, Pythia 410M, Pythia 1.4B) stay below the +0.21 healthy floor
- Output-controlled residual r_OC follows the same architecture split: healthy Pythia configurations +0.088 to +0.169, three collapse points at zero or negative r_OC
- Zero-shot downstream: 7 of 9 model-task cells at 20% flag rate fall between 10.9% and 13.4% on SQuAD 2.0 RAG, MedQA-USMLE, and TruthfulQA (Qwen 2.5 7B-I, Mistral 7B-I, Phi-3 Mini-I)

What changed since v2.4.0

- Pythia controlled-training suite: 8 sizes (70M to 12B), plus 1.4B-deduped (corpus control) and 1.4B shuffled-label run (probe-floor control). Within-recipe causation at the configuration-class level
- Three-model downstream evaluation on SQuAD 2.0, MedQA-USMLE, and TruthfulQA. Each model uses its own WikiText-trained probe with no task-specific retraining
- Nonlinear probe comparison at three collapse configurations with held-out train/validation/test splits: best-achievable MLP scores stay below the +0.21 healthy floor
- Mean-ablation patching on Llama 3.2 1B vs 3B: layer-1 MLP flips sign on the observer score across the cliff (sign pattern only; magnitudes unreliable under early-layer loss damage)
- Output-predictor width sweep (64 to 512 units on Qwen 7B) and 30-resample document-level bootstrap
- v3 protocol across 13 models: 7-seed evaluation, token budgets matched by hidden dimension
- Title changed from "Architecture Predicts Linear Readability of Decision Quality in Transformers" to "Architecture Determines Observability in Transformers"
- Pipeline: auto-generated macros and tables from results JSONs; 10-layer validation (just check) runs before each release
- 21 committed results JSONs (was 13)

Reproducibility

All results committed as JSON in results/. Statistical analysis reproduces from committed data without GPU:

cd nn-observability && python analysis/run_all.py

New model evaluation (any HuggingFace causal LM):

python scripts/run_model.py --model <model-id> --output <name>_results.json

Versioning

v3.0.0 is a new version in the same concept DOI chain (zenodo.19435674) as v2.4.0. Readers who cite the version-specific DOI get this release; readers who cite the concept DOI get whatever is latest. Code repository: https://github.com/tmcarmichael/nn-observability.

Notes

If you use this work, please cite it using the metadata from this file.

Files

architecture-determines-observability-2026-pre-print-v3.0.0.pdf

Files (2.6 MB)

Name Size Download all
md5:b83364bdcbf850d361437ee91b78748d
1.1 MB Preview Download
md5:f26d7f2bcf6028f89d81b143f69fc9ce
1.5 MB Preview Download

Additional details

Software

Repository URL
https://github.com/tmcarmichael/nn-observability
Programming language
Python
Development Status
Active

References

  • Abdin, M., Aneja, J., Awadalla, H., et al. (2024). Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. arXiv:2404.14219.
  • Afzal, M., Matthes, F., Chechik, G., & Ziser, Y. (2025). Knowing Before Saying: LLM Representations Encode Information About Chain-of-Thought Success Before Completion. Findings of ACL, 12791-12806.
  • Alain, G., & Bengio, Y. (2017). Understanding intermediate layers using linear classifier probes. arXiv:1610.01644.
  • Belinkov, Y. (2022). Probing Classifiers: Promises, Shortcomings, and Advances. Computational Linguistics, 48(1), 207-219.
  • Biderman, S., Schoelkopf, H., Anthony, Q., et al. (2023). Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling. ICML.
  • Bricken, T., Templeton, A., Batson, J., et al. (2023). Towards Monosemanticity: Decomposing Language Models With Dictionary Learning. Transformer Circuits Thread.
  • Burger, L., Hamprecht, F. A., & Nadler, B. (2024). Truth is Universal: Robust Detection of Lies in LLMs. NeurIPS.
  • Burns, C., Ye, H., Klein, D., & Steinhardt, J. (2023). Discovering Latent Knowledge in Language Models Without Supervision. ICLR.
  • Gemma Team. (2024). Gemma: Open Models Based on Gemini Research and Technology. arXiv:2403.08295.
  • Grattafiori, A., Dubey, A., Jauhri, A., et al. (2024). The Llama 3 Herd of Models. arXiv:2407.21783.
  • Guan, M. Y., Wang, M., Carroll, M., et al. (2025). Monitoring Monitorability. arXiv:2512.18311.
  • Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). On Calibration of Modern Neural Networks. ICML, 1321-1330.
  • Hewitt, J., & Liang, P. (2019). Designing and Interpreting Probes with Control Tasks. EMNLP-IJCNLP, 2733-2743.
  • Honovich, O., Aharoni, R., Herzig, J., et al. (2022). TRUE: Re-evaluating Factual Consistency Evaluation. NAACL, 3905-3920.
  • Huang, L., et al. (2025). Calibrating Verbal Uncertainty as a Linear Feature to Reduce Hallucinations. arXiv:2503.14477.
  • Jiang, A. Q., Sablayrolles, A., Mensch, A., et al. (2023). Mistral 7B. arXiv:2310.06825.
  • Kadavath, S., Conerly, T., Askell, A., et al. (2022). Language Models (Mostly) Know What They Know. arXiv:2207.05221.
  • Korbak, T., Balesni, M., Barnes, E., et al. (2025). Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety. arXiv:2507.11473.
  • Kossen, J., Han, J., Farquhar, S., Gal, Y., & Kuhn, L. (2024). Semantic Entropy Probes: Robust and Cheap Hallucination Detection in LLMs. arXiv:2406.15927.
  • Kramar, J., Engels, J., Wang, Z., et al. (2026). Building Production-Ready Probes For Gemini. arXiv:2601.11516.
  • Kuhn, L., Gal, Y., & Farquhar, S. (2023). Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation. ICLR.
  • Li, K., Patel, O., ViĆ©gas, F., Pfister, H., & Wattenberg, M. (2023). Inference-Time Intervention: Eliciting Truthful Answers from a Language Model. NeurIPS.
  • Marks, S., & Tegmark, M. (2023). The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets. arXiv:2310.06824.
  • McGuinness, M., Serrano, A., Bailey, L., & Emmons, S. (2025). Neural Chameleons: Language Models Can Learn to Hide Their Thoughts from Unseen Activation Monitors. arXiv:2512.11949.
  • Meng, K., Bau, D., Andonian, A., & Belinkov, Y. (2022). Locating and Editing Factual Associations in GPT. NeurIPS.
  • Merity, S., Xiong, C., Bradbury, J., & Socher, R. (2017). Pointer Sentinel Mixture Models. arXiv:1609.07843.
  • Min, S., Krishna, K., Lyu, X., et al. (2023). FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation. EMNLP.
  • Oldfield, J., Torr, P., Patras, I., Bibi, A., & Barez, F. (2026). Beyond Linear Probes: Dynamic Safety Monitoring for Language Models. ICLR.
  • Rozenfeld, S., Pankajakshan, R., Zloczower, I., Lenga, E., Gressel, G., & Mirsky, Y. (2026). GAVEL: Towards Rule-Based Safety through Activation Monitoring. ICLR.
  • Schuster, T., Fisch, A., Gupta, J. P., et al. (2022). Confident Adaptive Language Modeling. NeurIPS.
  • Sharkey, L., Chughtai, B., Batson, J., Lindsey, J., Wu, J., et al. (2025). Open Problems in Mechanistic Interpretability. arXiv:2501.16496.
  • Wen, B., Yao, Y., Feng, Z., et al. (2025). Know Your Limits: A Survey of Abstention in Large Language Models. TACL, 13, 529-556.
  • Yang, A., Yang, B., Hui, B., et al. (2024). Qwen2 Technical Report. arXiv:2407.10671.
  • Zhang, A., Chen, Y., Pan, J., et al. (2025). Reasoning Models Know When They're Right: Probing Hidden States for Self-Verification. arXiv:2504.05419.