Published May 15, 2023 | Version v1
Conference paper Open

Memory-Aware Latency Prediction Model for Concurrent Kernels in Partitionable GPUs: Simulations and Experiments

Description

The current trend in recently released Graphic Processing Units (GPUs) is to exploit
transistor scaling at the architectural level, hence, larger and larger GPUs in every new chip
generation are released. Architecturally, this implies that the clusters count of parallel processing
elements embedded within a single GPU die is constantly increasing, posing novel and interesting
research challenges for performance engineering in latency-sensitive scenarios. A single GPU
kernel is now likely not to scale linearly when dispatched in a GPU that features a larger cluster
count. This is either due to VRAM bandwidth acting as a bottleneck or due to the inability of the
kernel to saturate the massively parallel compute power available in these novel architectures.
In this context, novel scheduling approaches might be derived if we consider the GPU as a
partitionable compute engine in which multiple concurrent kernels can be scheduled in non-
overlapping sets of clusters. While such an approach is very effective in improving the GPU
overall utilization, it poses significant challenges in estimating kernel execution time latencies
when kernels are dispatched to variable-sized GPU partitions. Moreover, memory interference
within co-running kernels is a mandatory aspect to consider. In this work, we derive a practical
yet fairly accurate memory-aware latency estimation model for co-running GPU kernels.

Files

JSSPP23.pdf

Files (1.9 MB)

Name Size Download all
md5:d23a4c0e75aab5b3f646503ed4cad5f6
1.9 MB Preview Download

Additional details

Funding

IMOCO4.E – Intelligent Motion Control under Industry 4.E 101007311
European Commission