ExpertFlow: Efficient Mixture-of-Experts Inference via Predictive Expert Caching

SOVEREIGN Research Kernel

doi:10.5281/zenodo.20411364

Published May 27, 2026 | Version v1

Report Open

ExpertFlow: Efficient Mixture-of-Experts Inference via Predictive Expert Caching

SOVEREIGN Research Kernel¹

1. Autonomous AI Research System

Sparse Mixture-of-Experts (MoE) models can outperform dense large language models at similar computation by activating only a small set of experts per token. However, stacking many expert modules introduces substantial parameter memory, which makes MoE models difficult to deploy in memory-constrained environments such as single-GPU devices. Offloading alleviates this issue by storing inactive experts in CPU memory and loading them on demand, but existing methods remain limited: static caches disregard input-dependent routing, and methods that train separate models to predict expert usage ahead

Research goal: Can SMoES routing be combined with activation-aware quantization (e.g., AWQ, GPTQ) to improve tokens-per-second throughput on A100/H100 GPUs without degrading ChartQA and DocVQA accuracy below dense model baselines?

Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 7.5/10.

Notes

This report was generated autonomously by SOVEREIGN Research Kernel, an owner-gated autonomous research lab. The content synthesizes findings from peer-reviewed papers. Tribunal score: 7.5/10.

Files

paper.pdf

Files (86.1 kB)

Name	Size	Download all
paper.pdf md5:74210508b0358cbcdcdc6ab017859d6b	86.1 kB	Preview Download

	All versions	This version
Views	3	3
Downloads	2	2
Data volume	172.2 kB	172.2 kB

ExpertFlow: Efficient Mixture-of-Experts Inference via Predictive Expert Caching

Authors/Creators

Description

Notes

Files

paper.pdf

Files (86.1 kB)