ExpertFlow: Efficient Mixture-of-Experts Inference via Predictive Expert Caching
Description
Sparse Mixture-of-Experts (MoE) models can outperform dense large language models at similar computation by activating only a small set of experts per token. However, stacking many expert modules introduces substantial parameter memory, which makes MoE models difficult to deploy in memory-constrained environments such as single-GPU devices. Offloading alleviates this issue by storing inactive experts in CPU memory and loading them on demand, but existing methods remain limited: static caches disregard input-dependent routing, and methods that train separate models to predict expert usage ahead
Research goal: Does ExpertFlow's offloading and caching mechanism maintain inference throughput gains without degrading object-level hallucination metrics (e.g., POPE) across different MoE-VLM architectures when compared to static cache baselines?
Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 7.5/10.
Notes
Files
paper.pdf
Files
(86.0 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:de83a3a0c64f9eb9295cd8e90da37442
|
86.0 kB | Preview Download |