Published May 7, 2026 | Version v1
Publication Open

Cut MoE Inference Costs by 60–80%

Authors/Creators

Description

Modular MoE restructures how expert weights are stored, routed, and updated. Instead of keeping every expert resident across 5–8 GPUs, we extract a frozen shared core, compress the per-expert residuals by 8–16× (hierarchical shared-core extraction combined with S2LC—Shared Spectral Low-Rank Compression—spectral compression), and load only the active domain module on demand. The result: 1–2 GPUs per instance, sub-millisecond domain switching, and the ability to add or roll back capabilities without retraining.

Files

S2LC_White_Paper_v8.pdf

Files (432.0 kB)

Name Size Download all
md5:e7aa7fd944f6bbee0f93731feb2eaef9
432.0 kB Preview Download