Using Diagnostic Probing to Expose the Shallow Syntactic and Semantic Foundations of ChatGPT as a Large Language Model

Khan, Hashim

doi:10.5281/zenodo.18621205

Published February 12, 2026 | Version v1

Journal article Open

Using Diagnostic Probing to Expose the Shallow Syntactic and Semantic Foundations of ChatGPT as a Large Language Model

Khan, Hashim (Researcher)¹

1. National Research University Higher School of Economics

The remarkable conversational fluency of OpenAI's ChatGPT often creates an illusion of deep linguistic understanding, prompting its adoption across diverse sectors. This study critically evaluates this purported knowledge by implementing a comprehensive battery of diagnostic probes grounded in theoretical linguistics. We designed a multi-phase series of controlled experiments targeting core syntactic phenomena, including hierarchical agreement, syntactic islands, and binding theory, alongside semantic phenomena such as logical operators, quantifier scope, and presupposition. The study evaluated both GPT-3.5-turbo and GPT-4 models via the OpenAI API using forced-choice grammaticality judgments, plausibility assessments, and Chain-of-Thought (CoT) analysis to measure accuracy, stability, and reasoning soundness. Quantitative results revealed significant performance degradation on complex linguistic structures, with accuracy on long-range dependencies and quantifier scope falling to 67% and 42% for GPT-3.5, respectively. While GPT-4 demonstrated quantitatively superior performance, it exhibited qualitatively similar failure patterns, indicating that scaling alone does not address fundamental limitations. Qualitative analysis of reasoning chains revealed frequent post-hoc rationalization, associative drift, and a reliance on surface-level pattern matching rather than sound logical deduction. The findings robustly demonstrate that ChatGPT's linguistic knowledge is shallow, statistically driven, and non-causal, failing to reliably implement abstract grammatical rules or compositional semantics. We conclude that a paradigm shift in large language model (LLM) evaluation is necessary, moving from broad, aggregate benchmarks to targeted, causal probes that diagnose specific architectural limitations. These findings have significant implications for AI safety, reliability, and the future development of genuinely intelligent systems, underscoring the need for architectural innovations beyond mere scaling of parameters and data.

Files

Diagnostic Probing of LLMs.pdf

Files (306.5 kB)

Name	Size	Download all
Diagnostic Probing of LLMs.pdf md5:39d63b5fa05f39b4afc23204795e745d	306.5 kB	Preview Download

Additional details

Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. Bender, E. M., & Koller, A. (2020). Climbing towards NLU: On meaning, form, and understanding in the age of data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Chomsky, N. (1959). A review of B. F. Skinner's Verbal Behavior. Language, 35(1), 26–58. Ellis, R. (2017). Position paper: Moving task-based language teaching forward. Language Teaching, 50(4), 507–526. Geirhos, R., Jacobsen, J.-H., Michaelis, C., Zemel, R., Brendel, W., Bethge, M., & Wichmann, F. A. (2020). Shortcut learning in deep neural networks. Nature Machine Intelligence, 2(11), 665–673. Housen, A., & Kuiken, F. (2009). Complexity, accuracy, and fluency in second language acquisition. Applied Linguistics, 30(4), 461–473. Hupkes, D., Dankers, V., Mul, M., & Bruni, E. (2020). Compositionality decomposed: How do neural networks generalise? Journal of Artificial Intelligence Research, 67, 757–795. Hyland, K. (2005). Stance and engagement: A model of interaction in academic discourse. Discourse Studies, 7(2), 173–192. Lakretz, Y., Kruszewski, G., Desbordes, T., Hupkes, D., Dehaene, S., & Baroni, M. (2019). The emergence of number and syntax units in LSTM language models. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Manning, C. D., Clark, K., Hewitt, J., Khandelwal, U., & Levy, O. (2020). Emergent linguistic structure in artificial neural networks trained by self-supervision. Proceedings of the National Academy of Sciences, 117(48), 30046–30054.

	All versions	This version
Views	18	18
Downloads	4	4
Data volume	1.2 MB	1.2 MB

Using Diagnostic Probing to Expose the Shallow Syntactic and Semantic Foundations of ChatGPT as a Large Language Model

Authors/Creators

Description

Files

Diagnostic Probing of LLMs.pdf

Files (306.5 kB)

Additional details

References