How does the predictive expert caching latency and token scheduling overhead affect end-to-end tokens-per-seco
Description
We present DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE)nVision-Language Models that significantly improves upon its predecessor,nDeepSeek-VL, through two key major upgrades. For the vision component, wenincorporate a dynamic tiling vision encoding strategy designed for processingnhigh-resolution images with different aspect ratios. For the languagencomponent, we leverage DeepSeekMoE models with the Multi-head Latent Attentionnmechanism, which compresses Key-Value cache into latent vectors, to enablenefficient inference and high throughput. Trained on an improved vi
Research goal: How does the predictive expert caching latency and token scheduling overhead affect end-to-end tokens-per-second throughput on multimodal reasoning benchmarks (MMMU, MathVista) for MoE-LLaVA compared to dense model baselines at 7B and 13B parameter scales?
Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 7.8/10.
Notes
Files
paper.pdf
Files
(84.8 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:2bfc9540ec8c7c50a66ee19b2766de54
|
84.8 kB | Preview Download |