Published March 31, 2026 | Version v1
Preprint Open

Single-Kernel Fusion for Autoregressive Transformer Decoding via WebGPU Compute Shaders

  • 1. Independent Researcher

Description

Fusing the entire autoregressive decoding loop into a single WebGPU compute shader dispatch eliminates per-dispatch overhead that dominates browser-based LLM inference. On an Apple M2 Pro (N=30), the fused kernel achieves 6.6-13.5x over unfused dispatch, up to 29.9x over PyTorch MPS, and 34.1x at sequence length 256. Numerical correctness validated against f64 CPU reference (max error 3.019e-6). Includes parameterized WGSL shader generator, PyTorch baselines, torch.compile comparison, and 60/60 paper arithmetic verification tests. Code: https://github.com/abgnydn/webgpu-transformer-fusion

Files

Single_Kernel_Fusion_for_Autoregressive_Transformer_Decoding_via_WebGPU_Compute_Shaders.pdf

Additional details

Related works

Is supplement to
Preprint: 10.5281/zenodo.19343570 (DOI)