Published September 26, 2024 | Version v1
Conference paper Open

Uncut-GEMMs : Communication-aware matrix multiplication on multi-GPU nodes

Description

General Matrix Multiplication (GEMM) is one of the most common kernels in high-performance computing (HPC) and machine-learning (ML) applications, frequently dominating their execution time, rendering its performance vital. As multi-GPU nodes have become common in modern HPC systems, GEMM is usually offloaded on GPUs as its compute-intensive nature is a good match for their architecture. On the other hand, despite the GEMM kernel itself being usually compute-bound, execution on multi-GPU systems also requires fine-grained communication and task scheduling to achieve optimal performance. While numerous multi-GPU level-3 BLAS libraries have faced these issues in the past, they are bound by older design concepts that are not necessarily applicable to modern multi-GPU clusters, resulting in considerable deviation from peak performance. In this work, we thoroughly analyze the current challenges regarding data movement, caching, and overlap of multi-GPU GEMM, and the shortcomings of previous solutions, and provide a fresh approach to multi-GPU GEMM optimization.We devise a static scheduler for GEMM, enabling a variety of algorithmic, communication, and auto-tuning optimizations, and integrate those in an end-to-end open-source multi-GPU GEMM library. Our library is evaluated on a multi-GPU NVIDIA HGX system with 8 NVIDIA A100 GPUs, achieving on average a 1.37x and 1.29x performance improvement over the state-of-the-art multi-GPU GEMM libraries, for double and single precision, respectively.

Files

Uncut_GEMMs_final_cluster2024-1.pdf

Files (723.5 kB)

Name Size Download all
md5:9551fd983f6eb021e37e6b69410ac76f
723.5 kB Preview Download

Additional details

Funding

European Commission
HiDALGO2 – HPC and Big Data Technologies for Global Challenges 101093457

Software

Repository URL
https://github.com/p-anastas/PARALiA-GEMMex
Programming language
C, C++, Cuda
Development Status
Active