Published May 7, 2026
| Version v1
Publication
Open
Cut MoE Inference Costs by 60–80%
Authors/Creators
Description
Modular MoE restructures how expert weights are stored, routed, and updated. Instead of keeping every expert resident across 5–8 GPUs, we extract a frozen shared core, compress the per-expert residuals by 8–16× (hierarchical shared-core extraction combined with S2LC—Shared Spectral Low-Rank Compression—spectral compression), and load only the active domain module on demand. The result: 1–2 GPUs per instance, sub-millisecond domain switching, and the ability to add or roll back capabilities without retraining.
Files
S2LC_White_Paper_v8.pdf
Files
(432.0 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:e7aa7fd944f6bbee0f93731feb2eaef9
|
432.0 kB | Preview Download |