From Attention to Generative AI: A Decade of Architectural Innovations in Large Language Models
Authors/Creators
Description
Over the past decade, artificial intelligence has undergone a remarkable transformation, particularly in natural language processing (NLP). The field has progressed from recurrent and convolutional models with limited sequence capacity to the attention based Transformer architecture that revolutionized scalability and context modeling. This breakthrough enabled the emergence of foundation models such as BERT, GPT, and T5, which redefined pretraining and transfer learning paradigms. Building on these foundations, the scaling laws formalized in 2020 demonstrated predictable performance gains with larger models, leading to GPT-3 and the era of few-shot and zero shot learning. The mainstream adoption of Generative AI (GenAI) followed in 2022 with ChatGPT and has since expanded to multimodal and instruction-tuned systems such as GPT-4, GPT-4V, Anthropic’s Claude family, and Google DeepMind’s Gemini series. In parallel, the open-source ecosystem has accelerated innovation through Meta’s LLaMA models, Falcon, Mistral, Mixtral, BLOOM, HuggingFace-led initiatives, Stability AI’s StableLM, and recent entrants such as DeepSeek. This survey provides a comprehensive comparative analysis of these developments across model architecture, parameter growth, context length, benchmarks, efficiency methods, and alignment strategies. More than fifty influential works are consolidated, with tables, charts, and metrics illustrating key milestones. I also discuss open challenges including energy efficiency, interpretability, bias, governance, and responsible open-source deployment. By consolidating both proprietary and community-driven contributions, this paper highlights the opportunities and risks that will shape the next generation of AI research and applications.
Files
EJAET-12-9-37-52.pdf
Files
(719.2 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:057d51408185ca48a7ebef8bc3f32537
|
719.2 kB | Preview Download |
Additional details
References
- [1] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, "Improving language understanding by generative pre-training," OpenAI Technical Report, 2018.
- [2] T. B. Brown et al., "Language models are few-shot learners," in NeurIPS, 2020.
- [3] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "Bert: Pre-training of deep bidirectional transformers for language understanding," in NAACL, 2019.
- [4] H. Touvron et al., "Llama: Open and efficient foundation language models," 2023.
- [5] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, "Learning representations by back-propagating errors," Nature, vol. 323, no. 6088, pp. 533–536, 1986.
- [6] S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
- [7] K. Cho, B. van Merri¨enboer et al., "Learning phrase representations using rnn encoder–decoder for statistical machine translation," in EMNLP, 2014.
- [8] D. Bahdanau, K. Cho, and Y. Bengio, "Neural machine translation by jointly learning to align and translate," in ICLR, 2015.
- [9] I. Sutskever, O. Vinyals, and Q. V. Le, "Sequence to sequence learning with neural networks," in NeurIPS, 2014.
- [10] A. Vaswani, N. Shazeer et al., "Attention is all you need," in NeurIPS, 2017.
- [11] A. Radford, J. Wu et al., "Language models are unsupervised multitask learners," OpenAI Technical Report, 2019.
- [12] Z. Yang, Z. Dai et al., "Xlnet: Generalized autoregressive pretraining for language understanding," in NeurIPS, 2019.
- [13] C. Raffel et al., "Exploring the limits of transfer learning with a unified text-to-text transformer," JMLR, vol. 21, pp. 1–67, 2020.
- [14] J. Kaplan et al., "Scaling laws for neural language models," 2020.
- [15] M. Shoeybi et al., "Megatron-lm: Training multi-billion parameter language models using model parallelism," in NeurIPS Systems, 2019.
- [16] J. Rasley et al., "Deepspeed: System optimizations for training deep learning models at scale," in KDD DL Systems, 2020.
- [17] L. Ouyang et al., "Training language models to follow instructions with human feedback," in NeurIPS, 2022.
- [18] A. Radford et al., "Learning transferable visual models from natural language supervision," in ICML, 2021.
- [19] J.-B. Alayrac et al., "Flamingo: a visual language model for few-shot learning," in NeurIPS, 2022.
- [20] D. Patterson et al., "Carbon emissions and large neural network training," IEEE Computer, 2021.
- [21] E. M. Bender et al., "On the dangers of stochastic parrots: Can language models be too big?" in FAccT, 2021.
- [22] P. Lewis et al., "Retrieval-augmented generation for knowledge-intensive nlp," in NeurIPS, 2020.
- [23] Y. Kim, "Convolutional neural networks for sentence classification," in EMNLP, 2014.
- [24] A. Wang et al., "Glue: A multi-task benchmark and analysis platform for natural language understanding," in ICLR, 2019.
- [25] Y. Liu et al., "Roberta: A robustly optimized bert pretraining approach," arXiv:1907.11692, 2019.
- [26] Z. Dai, Z. Yang et al., "Transformer-xl: Attentive language models beyond a fixed-length context," in ACL, 2019.
- [27] Z. Lan et al., "Albert: A lite bert for self-supervised learning of language representations," in ICLR, 2020.
- [28] A. Wang et al., "Superglue: A stickier benchmark for general-purpose language understanding systems," in NeurIPS, 2019.
- [29] J. Lee et al., "Biobert: a pre-trained biomedical language representation model for biomedical text mining," Bioinformatics, 2020.
- [30] S. Rajbhandari et al., "Zero: Memory optimizations toward training trillion parameter models," in SC, 2020.
- [31] I. Beltagy, M. Peters, and A. Cohan, "Longformer: The long-document transformer," arXiv:2004.05150, 2020.
- [32] M. Zaheer et al., "Big bird: Transformers for longer sequences," in NeurIPS, 2020.
- [33] N. Kitaev, L. Kaiser, and A. Levskaya, "Reformer: The efficient transformer," in ICLR, 2020.
- [34] S. Wang et al., "Linformer: Self-attention with linear complexity," 2020.
- [35] J. Wei et al., "Emergent abilities of large language models," arXiv:2206.07682, 2022.
- [36] C. Raffel, N. Shazeer et al., "Exploring the limits of transfer learning with a unified text-to-text transformer," JMLR, vol. 21, pp. 1–67, 2020.
- [37] T. B. Brown et al., "Language models are few-shot learners," in Advances in Neural Information Processing Systems, 2020.
- [38] W. Fedus, B. Zoph, and N. Shazeer, "Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity," JMLR, vol. 23, pp. 1–39, 2022.
- [39] D. Lepikhin et al., "Gshard: Scaling giant models with conditional computation and automatic sharding," in International Conference on Learning Representations, 2021.
- [40] Y. Bai et al., "Constitutional ai: Harmlessness from ai feedback," 2022.
- [41] H. Touvron et al., "Llama 2: Open foundation and fine-tuned chat models," 2023.
- [42] Google Cloud, "Gemini 2.5 pro — generative ai on vertex ai," Product documentation, 2025. [Online]. Available: https://cloud.google.com/ vertex-ai/generative-ai/docs/models/gemini/2-5-pro
- [43] Meta, "Unmatched performance and efficiency — llama 4," Model site, 2025, includes Scout 10M-token context. [Online]. Available: https://www.llama.com/models/llama-4/
- [44] Google DeepMind, "Gemini 2.5: Our most intelligent ai model," Google Blog, Mar 25, 2025, 2025, model availability and reasoning updates. [Online]. Available: https://blog.google/technology/google-deepmind/ gemini-model-thinking-updates-march-2025/
- [45] Anthropic, "Claude opus 4.1," Newsroom, Aug 5, 2025, 2025. [Online]. Available: https://www.anthropic.com/news/claude-opus-4-1
- [46] OpenAI, "Gpt-5 is here," Product page, 2025. [Online]. Available: https://openai.com/gpt-5/
- [47] ——, "Introducing gpt-oss," Announcement, Aug 5, 2025, 2025. [Online]. Available: https://openai.com/index/introducing-gpt-oss/
- [48] E. M. Bender, T. Gebru, A. McMillan-Major, and S. Shmitchell, "On the dangers of stochastic parrots: Can language models be too big?" in Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 2021.
- [49] L. Ouyang, J. Wu, X. Jiang et al., "Training language models to follow instructions with human feedback," in NeurIPS, 2022.
- [50] Y. Bai et al., "Constitutional ai: Harmlessness from ai feedback," arXiv preprint arXiv:2212.08073, 2022.
- [51] P. Lewis, E. Perez, A. Piktus et al., "Retrieval-augmented generation for knowledge-intensive nlp," in NeurIPS, 2020.
- [52] BigScience Workshop, "Bloom: A 176b parameter open-access multilingual language model," Technical Report, 2022.
- [53] Anthropic, "Claude opus 4 and 4.1 can now end a rare subset of conversations," Anthropic research/policy note, 2025. [Online]. Available: https://www.anthropic.com/research/end-subset-conversations