There is a newer version of the record available.

Published March 13, 2026 | Version v1
Preprint Open

Microbenchmarking Instruction-Level Tensor Core Throttling in NVIDIA CMP 170HX

  • 1. Independent Researcher, NanJing, China

Description

This study takes the flagship NVIDIA CMP 170HX as its research subject, employing microbenchmarking methods to systematically investigate the instruction-level performance limiting mechanisms of its Tensor Cores. The core findings and contributions are threefold: First, experiments reveal for the first time a 256 fixed-cycle instruction execution throttling phenomenon in the CMP 170HX's Tensor Cores. The latency of a single MMA instruction is unaffected by the degree of Instruction Level Parallelism (ILP) and cannot be hidden through pipeline overlap. Furthermore, only 4 warps per Streaming Multiprocessor (SM) can simultaneously issue Tensor Core instructions, ultimately resulting in its FP16 Tensor Core realistic computing power being only 1/32 of its theoretical peak. Second, through multiple controlled experiments including ILP scaling, warp scaling, dependency chain construction, and cross-pipeline interference, the throttling mechanism is precisely pinpointed as a dispatch-level hardware gating limitation, rather than physical damage to the execution units or decoding delays. Third, based on experimental results, a theoretical model from microarchitecture to macroscopic computing power is constructed, completing a full theoretical close-line from the 256-cycle fixed latency and 4-warp issue limit to the measured total computing power of 6.3 TFLOPS.

Files

paper3-20260307v3.pdf

Files (1.4 MB)

Name Size Download all
md5:b77f7e011cb624e310fae2cb357a34fb
1.4 MB Preview Download

Additional details

Related works

Is continued by
Preprint: 10.5281/zenodo.18994970 (DOI)

Dates

Created
2026-03-07