Scaling Laws for Native Multimodal Models
Description
Building general-purpose models that can effectively perceive the world through multimodal signals has been a long-standing goal. Current approaches involve integrating separately pre-trained components, such as connecting vision encoders to LLMs and continuing multimodal training. While such approaches exhibit remarkable sample efficiency, it remains an open question whether such late-fusion architectures are inherently superior. In this work, we revisit the architectural design of native multimodal models (NMMs)-those trained from the ground up on all modali-ties-and conduct an extensive sca
Research goal: Does SMoES's soft modality-guided routing improve MoE-VLM accuracy on the MMMU benchmark compared to dense models of equivalent total parameter count, and how does this gap change when scaling from 7B to 34B total parameters?
Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 7.7/10.
Notes
Files
paper.pdf
Files
(82.8 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:feda58e73ea325f52b5bf01b0b9cf52e
|
82.8 kB | Preview Download |