Single-Kernel Fusion for Autoregressive Transformer Decoding via WebGPU Compute Shaders
Description
Fusing the entire autoregressive decoding loop into a single WebGPU compute shader dispatch eliminates per-dispatch overhead that dominates browser-based LLM inference. On an Apple M2 Pro (N=30), the fused kernel achieves 6.6-13.5x over unfused dispatch, up to 29.9x over PyTorch MPS, and 34.1x at sequence length 256. Numerical correctness validated against f64 CPU reference (max error 3.019e-6). Includes parameterized WGSL shader generator, PyTorch baselines, torch.compile comparison, and 60/60 paper arithmetic verification tests. Code: https://github.com/abgnydn/webgpu-transformer-fusion
Files
Single_Kernel_Fusion_for_Autoregressive_Transformer_Decoding_via_WebGPU_Compute_Shaders.pdf
Files
(444.9 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:ed2d4904ec519f4c1d6fbc06a59953e9
|
444.9 kB | Preview Download |
Additional details
Related works
- Is supplement to
- Preprint: 10.5281/zenodo.19343570 (DOI)
Software
- Repository URL
- https://github.com/abgnydn/webgpu-transformer-fusion