Kernels Might Be What You Need: Efficient Sequence Modeling with K-Operators

Published March 13, 2026 | Version 0.1

Preprint Open

We introduce K-Operators, a kernel-decomposed sequence modeling architecture that

replaces attention entirely with structured causal kernel operators. On Tiny Shakespeare

character-level modeling, a 1.14M-parameter K-Operators model achieves 4.43 ±0.05 validation

perplexity across 7 seeds—approaching the 4.35 PPL of a 10.65M-parameter Transformer baseline

(nanoGPT) while using 9.3×fewer parameters and requiring no positional encodings.

The architecture decomposes sequence mixing into a hierarchy of operators: K1 layers for

position-wise feature mixing, K2 layers for causal sequence interaction via a learned base kernel

combined with low-rank gamma-decayed recurrence, and a K0 layer for final rescaling. These are

composed into a K-Stack backbone (K1 →K(×N )

2 →K1 →K0) and refined through a learned

iterative equilibrium loop governed by a scalar step-size η. Two interchangeable gamma-decay

backends (mask and block) offer different memory/speed trade-offs. Diagnostic analysis reveals

interpretable learned dynamics: the model progressively transfers sequence mixing from the

initialized base kernel to the adaptive recurrent path, develops per-layer functional specialization,

and learns to self-regulate the refinement loop—including robustness to 10×learning rate

misspecification via automatic η suppression.

Files

Name	Size	Download all
k2_dynamics.pdf md5:b57abc5b849f3fdf3586371484b3b34b	30.8 kB	Preview Download
k_operators_paper-3.pdf md5:448cd1829847418bc23bae5f4ca56c33	350.9 kB	Preview Download
k_operators_paper.tex md5:aea0663186e2540b58b7a7eb9f6071f8	40.6 kB	Download