Using Diagnostic Probing to Expose the Shallow Syntactic and Semantic Foundations of ChatGPT as a Large Language Model
Authors/Creators
Description
The remarkable conversational fluency of OpenAI's ChatGPT often creates an illusion of deep linguistic understanding, prompting its adoption across diverse sectors. This study critically evaluates this purported knowledge by implementing a comprehensive battery of diagnostic probes grounded in theoretical linguistics. We designed a multi-phase series of controlled experiments targeting core syntactic phenomena, including hierarchical agreement, syntactic islands, and binding theory, alongside semantic phenomena such as logical operators, quantifier scope, and presupposition. The study evaluated both GPT-3.5-turbo and GPT-4 models via the OpenAI API using forced-choice grammaticality judgments, plausibility assessments, and Chain-of-Thought (CoT) analysis to measure accuracy, stability, and reasoning soundness. Quantitative results revealed significant performance degradation on complex linguistic structures, with accuracy on long-range dependencies and quantifier scope falling to 67% and 42% for GPT-3.5, respectively. While GPT-4 demonstrated quantitatively superior performance, it exhibited qualitatively similar failure patterns, indicating that scaling alone does not address fundamental limitations. Qualitative analysis of reasoning chains revealed frequent post-hoc rationalization, associative drift, and a reliance on surface-level pattern matching rather than sound logical deduction. The findings robustly demonstrate that ChatGPT's linguistic knowledge is shallow, statistically driven, and non-causal, failing to reliably implement abstract grammatical rules or compositional semantics. We conclude that a paradigm shift in large language model (LLM) evaluation is necessary, moving from broad, aggregate benchmarks to targeted, causal probes that diagnose specific architectural limitations. These findings have significant implications for AI safety, reliability, and the future development of genuinely intelligent systems, underscoring the need for architectural innovations beyond mere scaling of parameters and data.
Files
Diagnostic Probing of LLMs.pdf
Files
(306.5 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:39d63b5fa05f39b4afc23204795e745d
|
306.5 kB | Preview Download |
Additional details
References
- Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. Bender, E. M., & Koller, A. (2020). Climbing towards NLU: On meaning, form, and understanding in the age of data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Chomsky, N. (1959). A review of B. F. Skinner's Verbal Behavior. Language, 35(1), 26–58. Ellis, R. (2017). Position paper: Moving task-based language teaching forward. Language Teaching, 50(4), 507–526. Geirhos, R., Jacobsen, J.-H., Michaelis, C., Zemel, R., Brendel, W., Bethge, M., & Wichmann, F. A. (2020). Shortcut learning in deep neural networks. Nature Machine Intelligence, 2(11), 665–673. Housen, A., & Kuiken, F. (2009). Complexity, accuracy, and fluency in second language acquisition. Applied Linguistics, 30(4), 461–473. Hupkes, D., Dankers, V., Mul, M., & Bruni, E. (2020). Compositionality decomposed: How do neural networks generalise? Journal of Artificial Intelligence Research, 67, 757–795. Hyland, K. (2005). Stance and engagement: A model of interaction in academic discourse. Discourse Studies, 7(2), 173–192. Lakretz, Y., Kruszewski, G., Desbordes, T., Hupkes, D., Dehaene, S., & Baroni, M. (2019). The emergence of number and syntax units in LSTM language models. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Manning, C. D., Clark, K., Hewitt, J., Khandelwal, U., & Levy, O. (2020). Emergent linguistic structure in artificial neural networks trained by self-supervision. Proceedings of the National Academy of Sciences, 117(48), 30046–30054.