Attention Mechanisms in Transformers: A Comparative Survey and Structural Enhancements to Linear Attention
Description
This master’s thesis explores the trade-off between computational efficiency and modeling performance in Transformer architectures by studying and improving attention mechanisms. The work consists of two parts: (1) a comparative empirical study evaluating five attention variants—standard Multi-Head Attention, FlashAttention, Sparse Attention, Sliding-Window Attention, and Linear Attention—across multiple Transformer model families, including encoder-only (BERT-style), decoder-only (GPT-style), and full encoder–decoder architectures. This broader evaluation highlights how each mechanism behaves under different structural constraints and reveals the consistent underperformance of linear attention in decoder-only setups due to weaker long-range modeling. (2) A technical contribution proposing two new hybrid mechanisms, Linear SparseAttention and Linear Sliding-Window Attention, which enhance the expressiveness of linear attention while preserving its linear-time complexity. Experiments show that both hybrids significantly outperform standard linear attention and narrow the performance gap with full attention, offering a promising path toward efficient and scalable Transformer models deployable in resource-constrained settings.
Files
PFE.pdf
Files
(4.6 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:9cc5352ce96a1f7d7698dc30338c5336
|
4.6 MB | Preview Download |
Additional details
Dates
- Accepted
-
2025-06-17
Software
- Repository URL
- https://github.com/Ely0rda/Thesis
- Programming language
- Python
- Development Status
- Abandoned
References
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems, 30. arXiv:1706.03762.
- Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). BERT: Pre‑training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805.
- Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving Language Understanding by Generative Pre‑Training. OpenAI Technical Report.
- Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., … & Amodei, D. (2020). Language Models are Few‑Shot Learners. In Advances in Neural Information Processing Systems, 33. arXiv:2005.14165.
- Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2019). ALBERT: A Lite BERT for Self‑supervised Learning of Language Representations. arXiv:1909.11942.
- Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., … & Stoyanov, V. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692. :contentReference[oaicite:0]{index=0}
- Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT: A Distilled Version of BERT — Smaller, Faster, Cheaper and Lighter. arXiv:1910.01108.
- He, P., Liu, X., Gao, J., & Chen, W. (2020). DeBERTa: Decoding‑enhanced BERT with Disentangled Attention. arXiv:2006.03654. :contentReference[oaicite:1]{index=1}
- Child, R., Gray, S., Radford, A., & Sutskever, I. (2019). Generating Long Sequences with Sparse Transformers. arXiv:1904.10509. :contentReference[oaicite:2]{index=2}
- Beltagy, I., Peters, M. E., & Cohan, A. (2020). Longformer: The Long‑Document Transformer. arXiv:2004.05150. :contentReference[oaicite:3]{index=3}
- Zaheer, M., Guruganesh, G., Dubey, A., Ainslie, J., Alberti, C., Ontañón, S., … & Ahmed, A. (2020). Big Bird: Transformers for Longer Sequences. arXiv:2007.14062. :contentReference[oaicite:4]{index=4}
- Dao, T., & colleagues (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. NeurIPS 2022.