Empirical Evidence Of Interpretation Drift In ARC-Style Reasoning
Authors/Creators
Description
This paper provides empirical evidence of interpretation drift in large language models using ARC-style symbolic reasoning tasks. Interpretation drift refers to instability in a system’s internal task representation under fixed inputs and instructions, leading to incompatible task ontologies even in fully observable, non-linguistic settings.
Earlier work introduced interpretation drift as a theoretical explanation for reliability failures that persist despite improvements in model capability. However, governance and safety debates have continued to assume that such failures would resolve as models became more intelligent. The present work tests that assumption directly using ARC-style tasks, which the industry itself treats as a benchmark for abstraction and intelligence.
Under these controlled conditions, multiple frontier models were observed to diverge in inferred task structure, including object boundaries, dimensionality, and transformation rules, prior to symbolic reasoning. These divergences cannot be explained by prompt ambiguity, sampling variance, or output inconsistency.
This artifact provides empirical grounding for the interpretation drift framework introduced in:
Empirical Evidence Of Interpretation Drift In Large Language Models [https://doi.org/10.5281/zenodo.18219428]
The findings establish a governance-relevant boundary condition: systems that cannot maintain stable mappings between perceptual input and symbolic representation are not reliably evaluable and cannot be assigned autonomous decision-making authority in safety-critical or regulated contexts.
Files
NguyenE_2026_ARC_Artifact.pdf
Files
(863.0 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:09e8bad9754f01c9488ca99b71afed8f
|
863.0 kB | Preview Download |
Additional details
Related works
- Is supplemented by
- Other: 10.5281/zenodo.18219428 (DOI)
References
- Z. Ji et al., "Survey of hallucination in natural language generation," ACM Comput. Surveys, vol. 55, no. 12, pp. 1–38, Dec. 2023, doi: 10.1145/3571730. Available: https://arxiv.org/pdf/ 2202.03629
- Y. Zhang et al., "Siren's song in the AI ocean: A survey on hallucination in large language models," arXiv preprint arXiv:2309.01219, 2023. [Online]. Available: https://arxiv.org/abs/ 2309.01219
- A. Mallen, A. Asai, V. Zhong, R. Das, D. Khashabi, and H. Hajishirzi, "When not to trust language models: Investigating effectiveness of parametric and non-parametric memories," in Proc. 61st Annu. Meeting Assoc. Comput. Linguistics (Volume 1: Long Papers), Toronto, Canada, Jul. 2023, pp. 9802–9822, doi: 10.18653/v1/2023.acl-long.546. Available: https:// arxiv.org/pdf/2212.10511
- S. Kadavath et al., "Language models (mostly) know what they know," arXiv preprint arXiv:2207.05221, 2022. [Online]. Available: https://arxiv.org/abs/2207.05221
- I. R. McKenzie et al., "Inverse scaling: When bigger isn't better," Trans. Mach. Learn. Res., Oct. 2023. [Online]. Available: https://arxiv.org/abs/2306.09479
- P. Lewis et al., "Retrieval-augmented generation for knowledge-intensive NLP tasks," in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), 2020, pp. 9459–9474. [Online]. Available: https://arxiv.org/abs/2005.11401
- Z. Lin, S. Trivedi, and J. Sun, "Generating with confidence: Uncertainty quantification for black-box large language models," Trans. Mach. Learn. Res., 2023. [Online]. Available: https://arxiv.org/abs/2305.19187
- L. Ouyang et al., "Training language models to follow instructions with human feedback," in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), 2022, pp. 27730–27744. [Online]. Available: https://arxiv.org/abs/2203.02155
- M. T. Ribeiro, T. Wu, C. Guestrin, and S. Singh, "Beyond accuracy: Behavioral testing of NLP models with CheckList," in *Proc. 58th Annu. Meeting Assoc. Comput. Linguistics*, Jul. 2020, pp. 4902–4912, doi: 10.18653/v1/2020.acl-main.442. [Online]. Available: https:// arxiv.org/abs/2005.04118
- K. Goel et al., "Robustness Gym: Unifying the NLP evaluation landscape," in *Proc. Conf. North Amer. Chapter Assoc. Comput. Linguistics: Human Lang. Technol. (Demonstrations)*, Jun. 2021, pp. 42–55, doi: 10.18653/v1/2021.naacl-demos.6. [Online]. Available: https:// arxiv.org/abs/2101.04840
- M. Gardner et al., "Evaluating models' local decision boundaries via contrast sets," in *Findings Assoc. Comput. Linguistics: EMNLP 2020*, Nov. 2020, pp. 1307–1323, doi: 10.18653/v1/2020.findings-emnlp.117. [Online]. Available: https://arxiv.org/abs/2004.02709
- D. Jin, Z. Jin, J. T. Zhou, and P. Szolovits, "Is BERT really robust? A strong baseline for natural language attack on text classification and entailment," in *Proc. AAAI Conf. Artif. Intell.*, vol. 34, no. 05, Apr. 2020, pp. 8018–8025, doi: 10.1609/aaai.v34i05.6281. [Online]. Available: https://arxiv.org/abs/1907.11932
- S. Gururangan, S. Swayamdipta, O. Levy, R. Schwartz, S. R. Bowman, and N. A. Smith, "Annotation artifacts in natural language inference data," in *Proc. Conf. North Amer. Chapter Assoc. Comput. Linguistics: Human Lang. Technol.*, vol. 2, Jun. 2018, pp. 107– 112, doi: 10.18653/v1/N18-2017. [Online]. Available: https://arxiv.org/abs/1803.02324
- C. Zhu et al., "FreeLB: Enhanced adversarial training for natural language understanding," in *Proc. Int. Conf. Learn. Representations (ICLR)*, Apr. 2020. [Online]. Available: https:// arxiv.org/abs/1909.11764
- D. Hendrycks et al., "Measuring massive multitask language understanding," in *Proc. Int. Conf. Learn. Representations (ICLR)*, May 2021. [Online]. Available: https://arxiv.org/abs/ 2009.03300
- J. Gama, I. Žliobaitė, A. Bifet, M. Pechenizkiy, and A. Bouchachia, "A survey on concept drift adaptation," ACM Comput. Surveys, vol. 46, no. 4, pp. 44:1–44:37, Mar. 2014, doi: 10.1145/2523813. [Online]. Available: https://dl.acm.org/doi/10.1145/2523813
- Lu, A. Liu, F. Dong, F. Gu, J. Gama, and G. Zhang, "Learning under concept drift: A review," Available: https://arxiv.org/abs/2004.05785
- A. Lazaridou et al., "Mind the gap: Assessing temporal generalization in neural language models," in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), 2021, pp. 29348–29363. [Online]. Available: https://arxiv.org/abs/2102.01951
- V. V. Ramasesh, A. Lewkowycz, and E. Dyer, "Effect of scale on catastrophic forgetting in neural networks," in Proc. Int. Conf. Learn. Representations (ICLR), May 2022. [Online]. Available: https://openreview.net/forum?id=GhVS8_yPeEa
- J. Kirkpatrick et al., "Overcoming catastrophic forgetting in neural networks," Proc. Natl. Acad. Sci. U.S.A., vol. 114, no. 13, pp. 3521–3526, Mar. 2017, doi: 10.1073/ pnas.1611835114. [Online]. Available: https://arxiv.org/abs/1612.00796
- G. I. Parisi, R. Kemker, J. L. Part, C. Kanan, and S. Wermter, "Continual lifelong learning with neural networks: A review," Neural Netw., vol. 113, pp. 54–71, May 2019, doi: 10.1016/ j.neunet.2019.01.012. [Online]. Available: https://arxiv.org/abs/1802.07569S.
- Rabanser, S. Günnemann, and Z. C. Lipton, "Failing loudly: An empirical study of methods for detecting dataset shift," in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), 2019, pp. 1396–1408. [Online]. Available: https://arxiv.org/abs/1810.11953
- L. Huang et al., "A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions," ACM Trans. Inf. Syst., vol. 43, no. 2, pp. 1–55, 2025, doi: 10.1145/3703155. (arXiv preprint Nov. 2023, updated Nov. 2024). [Online]. Available: https://arxiv.org/abs/2311.05232
- S.T.I. Tonmoy et al., "A comprehensive survey of hallucination mitigation techniques in large language models," in Findings Assoc. Comput. Linguistics: EMNLP 2024, Dec. 2024, pp. 11709–11724. (arXiv preprint Jan. 2024). [Online]. Available: https://arxiv.org/abs/ 2401.01313
- K. Li et al., "Inference-time intervention: Eliciting truthful answers from a language model," arXiv preprint arXiv:2306.03341, 2023. [Online]. Available: https://arxiv.org/abs/2306.03341
- V. V. Ramasesh, A. Lewkowycz, and E. Dyer, "Effect of scale on catastrophic forgetting in neural networks," in Proc. Int. Conf. Learn. Representations (ICLR), May 2022. [Online]. Available: https://openreview.net/forum?id=GhVS8_yPeEa
- R. Eldan and M. Russinovich, "Who's Harry Potter? Approximate unlearning in LLMs," arXiv preprint arXiv:2310.02238, 2023. [Online]. Available: https://arxiv.org/abs/2310.02238
- E. Nguyen, "Empirical evidence of interpretation drift in large language models," Zenodo, Dec. 2025, doi: 10.5281/zenodo.18106825. [Online]. Available: https://zenodo.org/records/18106825