Tiny Scale Is All I Can Spare To Play With Transformer
Description
Introduction of the Transformer neural network architecture in the famous Attention Is All You Need paper has created a huge wave of AI development in recent years. The scaled dot-product attention allows for information to be processed with higher efficiency and quality, which the previous RNN-based models lacked. However Transformer-based models comes with their own challenges, particularly with parameter efficiency for tiny models with parameters ≤ 5M. At such small scale a Transformer model essentially uses more parameter than it really should. This sub-ten-million parameters domain space is very underexplored and for good reasons but I wanted to explore it anyways. So here-in this paper I am introducing Silia, a novel transformer architecture designed for efficient modelling & classification tasks under severe parameter budget. Training against GPT-2 architecture (Andrej Karpathy's nanoGPT project) with same "base" hyperparameters, training data and compute budget, Silia achieves comparable loss and generation quality with significantly less parameters.
Files
Silia.pdf
Files
(323.1 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:a38e8387af376dd6f975816dc1de5e1f
|
323.1 kB | Preview Download |
Additional details
Dates
- Submitted
-
2026-06Submitted the paper
Software
- Repository URL
- https://github.com/SrijanSriv211/Silia
- Programming language
- Python
- Development Status
- Active
References
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, (2017). Attention is all you need. arXiv preprint arXiv:1706.03762.
- Noam Shazeer, (2020). GLU variants improve transformer. arXiv preprint arXiv:2002.05202.
- Andrej Karpathy, (2022). nanoGPT. GitHub. https://github.com/karpathy/nanogpt