Exploiting Activation Sparsity with Dense to Dynamic-k Mixture-of-Experts Conversion

Szatkowski, Filip; Wójcik, Bartosz; Piórczyński, Mikołaj; Scardapane, Simone

doi:10.5281/zenodo.14409485

Published December 15, 2024 | Version v1

Publication Open

Exploiting Activation Sparsity with Dense to Dynamic-k Mixture-of-Experts Conversion

1. IDEAS NCBR
2. Warsaw University of Technology
3. Jagiellonian University
4. Sapienza University of Rome

Transformer models can face practical limitations due to their high computational
requirements. At the same time, such models exhibit significant activation sparsity,
which can be leveraged to reduce the inference cost by converting parts of the
network into equivalent Mixture-of-Experts (MoE) layers. Despite the crucial role
played by activation sparsity, its impact on this process remains unexplored. We
demonstrate that the efficiency of the conversion can be significantly enhanced
by a proper regularization of the activation sparsity of the base model. Moreover,
motivated by the high variance of the number of activated neurons for different
inputs, we introduce a more effective dynamic-k expert selection rule that adjusts
the number of executed experts on a per-token basis. To achieve further savings,
we extend this approach to multi-head attention projections. Finally, we develop an
efficient implementation that translates these computational savings into actual wallclock speedup. The proposed method, Dense to Dynamic-k Mixture-of-Experts
(D2DMoE), outperforms existing approaches on common NLP and vision tasks,
reducing inference cost by up to 60% without significantly impacting performance.

Files

Exploiting Activation Sparsity.pdf

Files (4.9 MB)

Name	Size	Download all
Exploiting Activation Sparsity.pdf md5:3a9f64edd6fed121be8b581333134494	4.9 MB	Preview Download

Additional details

European Commission
ELIAS - European Lighthouse of AI for Sustainability 101120237

Repository URL: https://arxiv.org/abs/2310.04361

	All versions	This version
Views	56	56
Downloads	44	44
Data volume	257.5 MB	257.5 MB

Exploiting Activation Sparsity with Dense to Dynamic-k Mixture-of-Experts Conversion

Files

Exploiting Activation Sparsity.pdf

Files (4.9 MB)

Additional details

Funding

Software

Exploiting Activation Sparsity with Dense to Dynamic-k Mixture-of-Experts Conversion

Creators

Description

Files

Exploiting Activation Sparsity.pdf

Files (4.9 MB)

Additional details

Funding

Software