Microbenchmarking Instruction-Level Tensor Core Throttling in NVIDIA CMP 170HX

Xing, Kangwei

doi:10.5281/zenodo.18995979

Published March 13, 2026 | Version v1

Preprint Open

Microbenchmarking Instruction-Level Tensor Core Throttling in NVIDIA CMP 170HX

Xing, Kangwei (Researcher)¹

1. Independent Researcher, NanJing, China

This study takes the flagship NVIDIA CMP 170HX as its research subject, employing microbenchmarking methods to systematically investigate the instruction-level performance limiting mechanisms of its Tensor Cores. The core findings and contributions are threefold: First, experiments reveal for the first time a 256 fixed-cycle instruction execution throttling phenomenon in the CMP 170HX's Tensor Cores. The latency of a single MMA instruction is unaffected by the degree of Instruction Level Parallelism (ILP) and cannot be hidden through pipeline overlap. Furthermore, only 4 warps per Streaming Multiprocessor (SM) can simultaneously issue Tensor Core instructions, ultimately resulting in its FP16 Tensor Core realistic computing power being only 1/32 of its theoretical peak. Second, through multiple controlled experiments including ILP scaling, warp scaling, dependency chain construction, and cross-pipeline interference, the throttling mechanism is precisely pinpointed as a dispatch-level hardware gating limitation, rather than physical damage to the execution units or decoding delays. Third, based on experimental results, a theoretical model from microarchitecture to macroscopic computing power is constructed, completing a full theoretical close-line from the 256-cycle fixed latency and 4-warp issue limit to the measured total computing power of 6.3 TFLOPS.

Files

paper3-20260307v3.pdf

Files (1.4 MB)

Name	Size	Download all
paper3-20260307v3.pdf md5:b77f7e011cb624e310fae2cb357a34fb	1.4 MB	Preview Download

Additional details

Is continued by: Preprint: 10.5281/zenodo.18994970 (DOI)

Created: 2026-03-07

	All versions	This version
Views	138	75
Downloads	91	50
Data volume	118.0 MB	99.9 MB

paper3-20260307v3.pdf

Files (1.4 MB)

Related works

Dates

Microbenchmarking Instruction-Level Tensor Core Throttling in NVIDIA CMP 170HX

Authors/Creators

Description

Files

paper3-20260307v3.pdf

Files (1.4 MB)

Additional details

Related works

Dates