What is the impact of dynamic token count on FLOPs efficiency and reasoning accuracy when processing variable-
Description
Vision Transformers (ViTs) have achieved state-of-the-art performance across various computer vision tasks, but their high computational cost remains a challenge. Token pruning has been proposed to reduce this cost by selectively removing less important tokens. While effective in vision tasks by discarding non-object regions, applying this technique to audio tasks presents unique challenges, as distinguishing relevant from irrelevant regions in time-frequency representations is less straightforward. In this study, for the first time, we applied token pruning to ViT-based audio classification m
Research goal: What is the impact of dynamic token count on FLOPs efficiency and reasoning accuracy when processing variable-complexity images with different tokenization strategies?
Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 7.5/10.
Notes
Files
paper.pdf
Files
(84.4 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:0a70e472681407a14541c67903044ae0
|
84.4 kB | Preview Download |