Microbenchmarking Instruction-Level Tensor Core Throttling in NVIDIA CMP 170HX
Description
This study takes the flagship NVIDIA CMP 170HX as its research subject, employing microbenchmarking methods to systematically investigate the instruction-level performance limiting mechanisms of its Tensor Cores. The core findings and contributions are threefold: First, experiments reveal for the first time a 256 fixed-cycle instruction execution throttling phenomenon in the CMP 170HX's Tensor Cores. The latency of a single MMA instruction is unaffected by the degree of Instruction Level Parallelism (ILP) and cannot be hidden through pipeline overlap. Furthermore, only 4 warps per Streaming Multiprocessor (SM) can simultaneously issue Tensor Core instructions, ultimately resulting in its FP16 Tensor Core realistic computing power being only 1/32 of its theoretical peak. Second, through multiple controlled experiments including ILP scaling, warp scaling, dependency chain construction, and cross-pipeline interference, the throttling mechanism is precisely pinpointed as a dispatch-level hardware gating limitation, rather than physical damage to the execution units or decoding delays. Third, based on experimental results, a theoretical model from microarchitecture to macroscopic computing power is constructed, completing a full theoretical close-line from the 256-cycle fixed latency and 4-warp issue limit to the measured total computing power of 6.3 TFLOPS.
Files
paper3-20260307v3.pdf
Files
(1.4 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:b77f7e011cb624e310fae2cb357a34fb
|
1.4 MB | Preview Download |
Additional details
Related works
- Is continued by
- Preprint: 10.5281/zenodo.18994970 (DOI)
Dates
- Created
-
2026-03-07