Cut MoE Inference Costs by 60–80%

Tang, Rujing

doi:10.5281/zenodo.20073351

Published May 7, 2026 | Version v1

Publication Open

Cut MoE Inference Costs by 60–80%

Tang, Rujing

Modular MoE restructures how expert weights are stored, routed, and updated. Instead of keeping every expert resident across 5–8 GPUs, we extract a frozen shared core, compress the per-expert residuals by 8–16× (hierarchical shared-core extraction combined with S2LC—Shared Spectral Low-Rank Compression—spectral compression), and load only the active domain module on demand. The result: 1–2 GPUs per instance, sub-millisecond domain switching, and the ability to add or roll back capabilities without retraining.

Files

S2LC_White_Paper_v8.pdf

Files (432.0 kB)

Name	Size	Download all
S2LC_White_Paper_v8.pdf md5:e7aa7fd944f6bbee0f93731feb2eaef9	432.0 kB	Preview Download

Views

Downloads

Show more details

	All versions	This version
Views	26	26
Downloads	23	23
Data volume	10.8 MB	10.8 MB

More info on how stats are collected....

DOI

Resource type

Publication

Publisher

Zenodo

License: Creative Commons Attribution 4.0 International

The Creative Commons Attribution license allows re-distribution and re-use of a licensed work on the condition that the creator is appropriately credited. Read more

Technical metadata

Created: May 7, 2026
Modified: May 7, 2026

Cut MoE Inference Costs by 60–80%

Authors/Creators

Description

Files

S2LC_White_Paper_v8.pdf

Files (432.0 kB)