Published June 10, 2026 | Version v1
Preprint Open

Tiny Scale Is All I Can Spare To Play With Transformer

Authors/Creators

  • 1. Independent Researcher

Description

Introduction of the Transformer neural network architecture in the famous Attention Is All You Need paper has created a huge wave of AI development in recent years. The scaled dot-product attention allows for information to be processed with higher efficiency and quality, which the previous RNN-based models lacked. However Transformer-based models comes with their own challenges, particularly with parameter efficiency for tiny models with parameters ≤ 5M. At such small scale a Transformer model essentially uses more parameter than it really should. This sub-ten-million parameters domain space is very underexplored and for good reasons but I wanted to explore it anyways. So here-in this paper I am introducing Silia, a novel transformer architecture designed for efficient modelling & classification tasks under severe parameter budget. Training against GPT-2 architecture (Andrej Karpathy's nanoGPT project) with same "base" hyperparameters, training data and compute budget, Silia achieves comparable loss and generation quality with significantly less parameters.

Files

Silia.pdf

Files (323.1 kB)

Name Size Download all
md5:a38e8387af376dd6f975816dc1de5e1f
323.1 kB Preview Download

Additional details

Dates

Submitted
2026-06
Submitted the paper

Software

Repository URL
https://github.com/SrijanSriv211/Silia
Programming language
Python
Development Status
Active

References

  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, (2017). Attention is all you need. arXiv preprint arXiv:1706.03762.
  • Noam Shazeer, (2020). GLU variants improve transformer. arXiv preprint arXiv:2002.05202.
  • Andrej Karpathy, (2022). nanoGPT. GitHub. https://github.com/karpathy/nanogpt