Uncut-GEMMs : Communication-aware matrix multiplication on multi-GPU nodes

Anastasiadis, Petros; Papadopoulou, Nikela; Koziris, Nectarios; Goumas, Georgios

doi:10.1109/CLUSTER59578.2024.00020

Published September 26, 2024 | Version v1

Conference paper Open

Uncut-GEMMs : Communication-aware matrix multiplication on multi-GPU nodes

1. National Technical University of Athens
2. University of Glasgow

General Matrix Multiplication (GEMM) is one of the most common kernels in high-performance computing (HPC) and machine-learning (ML) applications, frequently dominating their execution time, rendering its performance vital. As multi-GPU nodes have become common in modern HPC systems, GEMM is usually offloaded on GPUs as its compute-intensive nature is a good match for their architecture. On the other hand, despite the GEMM kernel itself being usually compute-bound, execution on multi-GPU systems also requires fine-grained communication and task scheduling to achieve optimal performance. While numerous multi-GPU level-3 BLAS libraries have faced these issues in the past, they are bound by older design concepts that are not necessarily applicable to modern multi-GPU clusters, resulting in considerable deviation from peak performance. In this work, we thoroughly analyze the current challenges regarding data movement, caching, and overlap of multi-GPU GEMM, and the shortcomings of previous solutions, and provide a fresh approach to multi-GPU GEMM optimization.We devise a static scheduler for GEMM, enabling a variety of algorithmic, communication, and auto-tuning optimizations, and integrate those in an end-to-end open-source multi-GPU GEMM library. Our library is evaluated on a multi-GPU NVIDIA HGX system with 8 NVIDIA A100 GPUs, achieving on average a 1.37x and 1.29x performance improvement over the state-of-the-art multi-GPU GEMM libraries, for double and single precision, respectively.

Files

Uncut_GEMMs_final_cluster2024-1.pdf

Files (723.5 kB)

Name	Size	Download all
Uncut_GEMMs_final_cluster2024-1.pdf md5:9551fd983f6eb021e37e6b69410ac76f	723.5 kB	Preview Download

Additional details

European Commission
HiDALGO2 - HPC and Big Data Technologies for Global Challenges 101093457

Repository URL: https://github.com/p-anastas/PARALiA-GEMMex
Programming language: C, C++, Cuda
Development Status: Active

	All versions	This version
Views	27	27
Downloads	103	103
Data volume	81.0 MB	81.0 MB

Uncut-GEMMs : Communication-aware matrix multiplication on multi-GPU nodes

Files

Uncut_GEMMs_final_cluster2024-1.pdf

Files (723.5 kB)

Additional details

Funding

Software

Uncut-GEMMs : Communication-aware matrix multiplication on multi-GPU nodes

Creators

Description

Files

Uncut_GEMMs_final_cluster2024-1.pdf

Files (723.5 kB)

Additional details

Funding

Software