Memory-Aware Latency Prediction Model for Concurrent Kernels in Partitionable GPUs: Simulations and Experiments
Creators
- 1. Unimore
Description
The current trend in recently released Graphic Processing Units (GPUs) is to exploit
transistor scaling at the architectural level, hence, larger and larger GPUs in every new chip
generation are released. Architecturally, this implies that the clusters count of parallel processing
elements embedded within a single GPU die is constantly increasing, posing novel and interesting
research challenges for performance engineering in latency-sensitive scenarios. A single GPU
kernel is now likely not to scale linearly when dispatched in a GPU that features a larger cluster
count. This is either due to VRAM bandwidth acting as a bottleneck or due to the inability of the
kernel to saturate the massively parallel compute power available in these novel architectures.
In this context, novel scheduling approaches might be derived if we consider the GPU as a
partitionable compute engine in which multiple concurrent kernels can be scheduled in non-
overlapping sets of clusters. While such an approach is very effective in improving the GPU
overall utilization, it poses significant challenges in estimating kernel execution time latencies
when kernels are dispatched to variable-sized GPU partitions. Moreover, memory interference
within co-running kernels is a mandatory aspect to consider. In this work, we derive a practical
yet fairly accurate memory-aware latency estimation model for co-running GPU kernels.
Files
JSSPP23.pdf
Files
(1.9 MB)
Name | Size | Download all |
---|---|---|
md5:d23a4c0e75aab5b3f646503ed4cad5f6
|
1.9 MB | Preview Download |