Published December 10, 2025 | Version 1.0
Thesis Open

Attention Mechanisms in Transformers: A Comparative Survey and Structural Enhancements to Linear Attention

  • 1. ROR icon Université Sultan Moulay Slimane

Description

This master’s thesis explores the trade-off between computational efficiency and modeling performance in Transformer architectures by studying and improving attention mechanisms. The work consists of two parts: (1) a comparative empirical study evaluating five attention variants—standard Multi-Head Attention, FlashAttention, Sparse Attention, Sliding-Window Attention, and Linear Attention—across multiple Transformer model families, including encoder-only (BERT-style), decoder-only (GPT-style), and full encoder–decoder architectures. This broader evaluation highlights how each mechanism behaves under different structural constraints and reveals the consistent underperformance of linear attention in decoder-only setups due to weaker long-range modeling. (2) A technical contribution proposing two new hybrid mechanisms, Linear SparseAttention and Linear Sliding-Window Attention, which enhance the expressiveness of linear attention while preserving its linear-time complexity. Experiments show that both hybrids significantly outperform standard linear attention and narrow the performance gap with full attention, offering a promising path toward efficient and scalable Transformer models deployable in resource-constrained settings.

Files

PFE.pdf

Files (4.6 MB)

Name Size Download all
md5:9cc5352ce96a1f7d7698dc30338c5336
4.6 MB Preview Download

Additional details

Dates

Accepted
2025-06-17

Software

Repository URL
https://github.com/Ely0rda/Thesis
Programming language
Python
Development Status
Abandoned

References

  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems, 30. arXiv:1706.03762.
  • Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). BERT: Pre‑training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805.
  • Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving Language Understanding by Generative Pre‑Training. OpenAI Technical Report.
  • Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., … & Amodei, D. (2020). Language Models are Few‑Shot Learners. In Advances in Neural Information Processing Systems, 33. arXiv:2005.14165.
  • Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2019). ALBERT: A Lite BERT for Self‑supervised Learning of Language Representations. arXiv:1909.11942.
  • Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., … & Stoyanov, V. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692. :contentReference[oaicite:0]{index=0}
  • Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT: A Distilled Version of BERT — Smaller, Faster, Cheaper and Lighter. arXiv:1910.01108.
  • He, P., Liu, X., Gao, J., & Chen, W. (2020). DeBERTa: Decoding‑enhanced BERT with Disentangled Attention. arXiv:2006.03654. :contentReference[oaicite:1]{index=1}
  • Child, R., Gray, S., Radford, A., & Sutskever, I. (2019). Generating Long Sequences with Sparse Transformers. arXiv:1904.10509. :contentReference[oaicite:2]{index=2}
  • Beltagy, I., Peters, M. E., & Cohan, A. (2020). Longformer: The Long‑Document Transformer. arXiv:2004.05150. :contentReference[oaicite:3]{index=3}
  • Zaheer, M., Guruganesh, G., Dubey, A., Ainslie, J., Alberti, C., Ontañón, S., … & Ahmed, A. (2020). Big Bird: Transformers for Longer Sequences. arXiv:2007.14062. :contentReference[oaicite:4]{index=4}
  • Dao, T., & colleagues (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. NeurIPS 2022.