Kernels Might Be What You Need: Efficient Sequence Modeling with K-Operators
Description
We introduce K-Operators, a kernel-decomposed sequence modeling architecture that
replaces attention entirely with structured causal kernel operators. On Tiny Shakespeare
character-level modeling, a 1.14M-parameter K-Operators model achieves 4.43 ±0.05 validation
perplexity across 7 seeds—approaching the 4.35 PPL of a 10.65M-parameter Transformer baseline
(nanoGPT) while using 9.3×fewer parameters and requiring no positional encodings.
The architecture decomposes sequence mixing into a hierarchy of operators: K1 layers for
position-wise feature mixing, K2 layers for causal sequence interaction via a learned base kernel
combined with low-rank gamma-decayed recurrence, and a K0 layer for final rescaling. These are
composed into a K-Stack backbone (K1 →K(×N )
2 →K1 →K0) and refined through a learned
iterative equilibrium loop governed by a scalar step-size η. Two interchangeable gamma-decay
backends (mask and block) offer different memory/speed trade-offs. Diagnostic analysis reveals
interpretable learned dynamics: the model progressively transfers sequence mixing from the
initialized base kernel to the adaptive recurrent path, develops per-layer functional specialization,
and learns to self-regulate the refinement loop—including robustness to 10×learning rate
misspecification via automatic η suppression.
Files
k2_dynamics.pdf
Additional details
Software
- Repository URL
- https://github.com/AileenKoneko/K-language-model
- Programming language
- Python
- Development Status
- Active