Architecture Determines Observability in Transformers

Carmichael, Thomas

doi:10.5281/zenodo.19704227

Published April 23, 2026 | Version v3.0.0

Preprint Open

Architecture Determines Observability in Transformers

Carmichael, Thomas

This is an early development version. Results, citations, and analysis may be incomplete or incorrect. For the current version, see: https://doi.org/10.5281/zenodo.19435674

...

At 20% flag rate on MedQA-USMLE, a linear probe on frozen mid-layer activations catches 13.4% of confidently wrong answers, roughly one in seven of the mistakes the model makes on medical licensing questions. Output confidence and a trained predictor on the full last-layer representation mark those same answers correct. The probe was trained zero-shot on Wikipedia next-token prediction and applied without task-specific data. A single dot product per token.

Whether this signal exists depends on architecture. Under Pythia's controlled training, three (24 layers, 16 heads) configurations collapse to partial correlation near 0.10 across a 3.5x parameter gap, 2x hidden dimension, 2x head dimension, and two Pile variants. Six other Pythia sizes, spanning depths 6 to 36 and head counts 8 to 40, produce values between +0.21 and +0.38 with no intermediate points observed. The pattern replicates under Llama's recipe at a different configuration: 1B at +0.286, while 3B and 8B collapse. Across 13 models in 6 families, family membership captures eta-squared = 0.92 (permutation p = 0.006); log parameter count has no effect.

Key results

- Pythia (24L, 16H) collapse replicates at three points: 410M +0.105, 1.4B +0.106, 1.4B-deduped +0.100 (range 0.006 across a 3.5x parameter gap and the Pile deduplication boundary)
- Within-Pythia exact permutation: 2-vs-6 p = 0.036 (1/28); tightens to p = 0.012 (1/84) with 1.4B-deduped as a third collapsed point
- Shuffle test on Pythia 1.4B at layer 17: real probe +0.106 exceeds all 10 shuffled-label permutations (null -0.002 +/- 0.036, 3.0 sigma)
- Cross-family permutation: F = 15.77, p = 0.006, eta-squared = 0.92 (13 models, 6 families)
- Llama cliff at a different configuration: 1B (16L/32H) +0.286, 3B (28L/24H) +0.091, 8B (32L/32H) +0.093
- Matched 3B scale: Qwen 2.5 +0.263 vs Llama 3.2 +0.091, 2.9x gap with non-overlapping per-seed distributions
- 48.6-75.8% of raw probe-loss correlation redundant with max softmax and activation norm across 12 of 13 models
- 20-seed hardening on GPT-2 124M: +0.282 +/- 0.001, seed agreement +0.993
- 512-unit MLP on the last-layer representation absorbs no more signal than a 64-unit bottleneck at any tested scale
- Nonlinear probes at the three collapse points (Llama 3B, Pythia 410M, Pythia 1.4B) stay below the +0.21 healthy floor
- Output-controlled residual r_OC follows the same architecture split: healthy Pythia configurations +0.088 to +0.169, three collapse points at zero or negative r_OC
- Zero-shot downstream: 7 of 9 model-task cells at 20% flag rate fall between 10.9% and 13.4% on SQuAD 2.0 RAG, MedQA-USMLE, and TruthfulQA (Qwen 2.5 7B-I, Mistral 7B-I, Phi-3 Mini-I)

What changed since v2.4.0

- Pythia controlled-training suite: 8 sizes (70M to 12B), plus 1.4B-deduped (corpus control) and 1.4B shuffled-label run (probe-floor control). Within-recipe causation at the configuration-class level
- Three-model downstream evaluation on SQuAD 2.0, MedQA-USMLE, and TruthfulQA. Each model uses its own WikiText-trained probe with no task-specific retraining
- Nonlinear probe comparison at three collapse configurations with held-out train/validation/test splits: best-achievable MLP scores stay below the +0.21 healthy floor
- Mean-ablation patching on Llama 3.2 1B vs 3B: layer-1 MLP flips sign on the observer score across the cliff (sign pattern only; magnitudes unreliable under early-layer loss damage)
- Output-predictor width sweep (64 to 512 units on Qwen 7B) and 30-resample document-level bootstrap
- v3 protocol across 13 models: 7-seed evaluation, token budgets matched by hidden dimension
- Title changed from "Architecture Predicts Linear Readability of Decision Quality in Transformers" to "Architecture Determines Observability in Transformers"
- Pipeline: auto-generated macros and tables from results JSONs; 10-layer validation (just check) runs before each release
- 21 committed results JSONs (was 13)

Reproducibility

All results committed as JSON in results/. Statistical analysis reproduces from committed data without GPU:

cd nn-observability && python analysis/run_all.py

New model evaluation (any HuggingFace causal LM):

python scripts/run_model.py --model <model-id> --output <name>_results.json

Versioning

v3.0.0 is a new version in the same concept DOI chain (zenodo.19435674) as v2.4.0. Readers who cite the version-specific DOI get this release; readers who cite the concept DOI get whatever is latest. Code repository: https://github.com/tmcarmichael/nn-observability.

Notes

If you use this work, please cite it using the metadata from this file.

Files

architecture-determines-observability-2026-pre-print-v3.0.0.pdf

Files (2.6 MB)

Name	Size	Download all
architecture-determines-observability-2026-pre-print-v3.0.0.pdf md5:b83364bdcbf850d361437ee91b78748d	1.1 MB	Preview Download
nn-observability-3.0.0.zip md5:f26d7f2bcf6028f89d81b143f69fc9ce	1.5 MB	Preview Download

Additional details

Repository URL: https://github.com/tmcarmichael/nn-observability
Programming language: Python
Development Status: Active

Abdin, M., Aneja, J., Awadalla, H., et al. (2024). Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. arXiv:2404.14219.
Afzal, M., Matthes, F., Chechik, G., & Ziser, Y. (2025). Knowing Before Saying: LLM Representations Encode Information About Chain-of-Thought Success Before Completion. Findings of ACL, 12791-12806.
Alain, G., & Bengio, Y. (2017). Understanding intermediate layers using linear classifier probes. arXiv:1610.01644.
Belinkov, Y. (2022). Probing Classifiers: Promises, Shortcomings, and Advances. Computational Linguistics, 48(1), 207-219.
Biderman, S., Schoelkopf, H., Anthony, Q., et al. (2023). Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling. ICML.
Bricken, T., Templeton, A., Batson, J., et al. (2023). Towards Monosemanticity: Decomposing Language Models With Dictionary Learning. Transformer Circuits Thread.
Burger, L., Hamprecht, F. A., & Nadler, B. (2024). Truth is Universal: Robust Detection of Lies in LLMs. NeurIPS.
Burns, C., Ye, H., Klein, D., & Steinhardt, J. (2023). Discovering Latent Knowledge in Language Models Without Supervision. ICLR.
Gemma Team. (2024). Gemma: Open Models Based on Gemini Research and Technology. arXiv:2403.08295.
Grattafiori, A., Dubey, A., Jauhri, A., et al. (2024). The Llama 3 Herd of Models. arXiv:2407.21783.
Guan, M. Y., Wang, M., Carroll, M., et al. (2025). Monitoring Monitorability. arXiv:2512.18311.
Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). On Calibration of Modern Neural Networks. ICML, 1321-1330.
Hewitt, J., & Liang, P. (2019). Designing and Interpreting Probes with Control Tasks. EMNLP-IJCNLP, 2733-2743.
Honovich, O., Aharoni, R., Herzig, J., et al. (2022). TRUE: Re-evaluating Factual Consistency Evaluation. NAACL, 3905-3920.
Huang, L., et al. (2025). Calibrating Verbal Uncertainty as a Linear Feature to Reduce Hallucinations. arXiv:2503.14477.
Jiang, A. Q., Sablayrolles, A., Mensch, A., et al. (2023). Mistral 7B. arXiv:2310.06825.
Kadavath, S., Conerly, T., Askell, A., et al. (2022). Language Models (Mostly) Know What They Know. arXiv:2207.05221.
Korbak, T., Balesni, M., Barnes, E., et al. (2025). Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety. arXiv:2507.11473.
Kossen, J., Han, J., Farquhar, S., Gal, Y., & Kuhn, L. (2024). Semantic Entropy Probes: Robust and Cheap Hallucination Detection in LLMs. arXiv:2406.15927.
Kramar, J., Engels, J., Wang, Z., et al. (2026). Building Production-Ready Probes For Gemini. arXiv:2601.11516.
Kuhn, L., Gal, Y., & Farquhar, S. (2023). Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation. ICLR.
Li, K., Patel, O., Viégas, F., Pfister, H., & Wattenberg, M. (2023). Inference-Time Intervention: Eliciting Truthful Answers from a Language Model. NeurIPS.
Marks, S., & Tegmark, M. (2023). The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets. arXiv:2310.06824.
McGuinness, M., Serrano, A., Bailey, L., & Emmons, S. (2025). Neural Chameleons: Language Models Can Learn to Hide Their Thoughts from Unseen Activation Monitors. arXiv:2512.11949.
Meng, K., Bau, D., Andonian, A., & Belinkov, Y. (2022). Locating and Editing Factual Associations in GPT. NeurIPS.
Merity, S., Xiong, C., Bradbury, J., & Socher, R. (2017). Pointer Sentinel Mixture Models. arXiv:1609.07843.
Min, S., Krishna, K., Lyu, X., et al. (2023). FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation. EMNLP.
Oldfield, J., Torr, P., Patras, I., Bibi, A., & Barez, F. (2026). Beyond Linear Probes: Dynamic Safety Monitoring for Language Models. ICLR.
Rozenfeld, S., Pankajakshan, R., Zloczower, I., Lenga, E., Gressel, G., & Mirsky, Y. (2026). GAVEL: Towards Rule-Based Safety through Activation Monitoring. ICLR.
Schuster, T., Fisch, A., Gupta, J. P., et al. (2022). Confident Adaptive Language Modeling. NeurIPS.
Sharkey, L., Chughtai, B., Batson, J., Lindsey, J., Wu, J., et al. (2025). Open Problems in Mechanistic Interpretability. arXiv:2501.16496.
Wen, B., Yao, Y., Feng, Z., et al. (2025). Know Your Limits: A Survey of Abstention in Large Language Models. TACL, 13, 529-556.
Yang, A., Yang, B., Hui, B., et al. (2024). Qwen2 Technical Report. arXiv:2407.10671.
Zhang, A., Chen, Y., Pan, J., et al. (2025). Reasoning Models Know When They're Right: Probing Hidden States for Self-Verification. arXiv:2504.05419.

	All versions	This version
Views	240	30
Downloads	147	19
Data volume	216.6 MB	44.1 MB

Key results

What changed since v2.4.0

Reproducibility

Versioning

architecture-determines-observability-2026-pre-print-v3.0.0.pdf

Files (2.6 MB)

Related works

Software

References

Architecture Determines Observability in Transformers

Authors/Creators

Description

Key results

What changed since v2.4.0

Reproducibility

Versioning

Notes

Files

architecture-determines-observability-2026-pre-print-v3.0.0.pdf

Files (2.6 MB)

Additional details

Related works

Software

References