K-Operators: A Linear-Time Sequence Mixer with Learned Decayed Positional Kernels

Koneko, Aileen

doi:10.5281/zenodo.19136398

Published March 20, 2026 | Version v2

Preprint Open

K-Operators: A Linear-Time Sequence Mixer with Learned Decayed Positional Kernels

Koneko, Aileen (Researcher)¹

1. Independent Researcher

We introduce K-Operators, a sequence modeling architecture designed for linear-time
execution, combining learned exponential decay with learnable positional kernels. The core K2
layer decomposes sequence mixing into two complementary paths: (1) a low-rank gamma-decayed
recurrent interaction with per-channel learned decay rates spanning short to long memory, and
(2) a learnable causal base kernel Kbase providing asymmetric local correction that exponential
decay alone cannot express.
Systematic ablation across tokenization granularities reveals that removing either component
degrades performance even under equal parameter budgets: on WikiText-2 (subword), the full
architecture achieves 19.99 ± 0.09 PPL at 4.08M parameters (5-seed sweep) vs. 20.99 PPL
for an equal-capacity model without Kbase; on Tiny Shakespeare (character-level), 4.41 ± 0.01
PPL at 0.81M parameters (5-seed sweep) vs. 4.78 PPL without Kbase—within 0.06 PPL of
a 10.65M parameter Transformer baseline. The optimal contribution of Kbase scales inversely
with token granularity—∼4% for character-level, ∼0.5% for subword—but is never zero. This
ratio is discovered automatically via gradient descent with a sigmoid floor that acts as implicit
architectural regularization.
Uncapping the gamma decay range from [0.85, 0.995] to [0.15, 0.995] yields substantial gains:
the model learns to use the full spectrum, with some channels selecting γ ≈ 0.15 (2-token effective
window) while others maintain γ > 0.99 (100+ token memory). The architecture does not require
explicit positional encodings; positional information is instead captured implicitly through the
learned causal kernel structure.
We also describe an iterative equilibrium refinement loop with learned step-size η. While
mathematically motivated, ablation shows refinement consistently hurts performance in our
experiments; we document it for completeness and future investigation.

Files

1hg260.png

Files (2.1 MB)

Name	Size	Download all
1hg260.png md5:7f56683965e5e4c43a16b07e95e2d503	218.4 kB	Preview Download
checkpoint_stats.png md5:17cafc80886ab54aa2de0cc08ec47b97	166.1 kB	Preview Download
k-operators-linear-time-sequence-mixer.tex md5:ca5b892ea1f054e0afa227f9df288cca	35.1 kB	Download
K-operators_linear-time-sequence-mixer.pdf md5:eddfa111acd20ca07cd58cefa11f40cd	1.1 MB	Preview Download
k2_dynamics_v2.png md5:2aa4547684562a19776922ef2a82cd0a	176.3 kB	Preview Download
kbase_heatmap.png md5:69a46d2d5a453f3d7e794e33eded17f8	366.4 kB	Preview Download

Additional details

Is new version of: Preprint: 10.5281/zenodo.19004569 (DOI)
Is supplemented by: https://github.com/AileenKoneko/K-language-model (URL)

Repository URL: https://github.com/AileenKoneko/K-language-model
Programming language: Python
Development Status: Active

	All versions	This version
Views	131	63
Downloads	68	19
Data volume	18.1 MB	12.8 MB

1hg260.png

Files (2.1 MB)

Related works

Software

K-Operators: A Linear-Time Sequence Mixer with Learned Decayed Positional Kernels

Authors/Creators

Description

Files

1hg260.png

Files (2.1 MB)

Additional details

Related works

Software