Published June 29, 2025 | Version v1
Journal article Open

GPU Reliability in AI Clusters: A Study of Failure Modes and Effects

Authors/Creators

  • 1. Meta Platforms, Inc., USA

Description

This article presents a comprehensive explanation of GPU reliability challenges in artificial intelligence clusters, addressing a critical gap in understanding how modern AI workloads affect accelerator hardware. The article establishes a detailed taxonomy of GPU failure modes specific to AI workloads, with particular attention to thermal issues, power delivery instabilities, memory subsystem degradation, and manufacturing variations. The article reveals that the sustained high-utilization characteristics of deep learning training create unique stress patterns that accelerate hardware degradation through mechanisms distinct from those observed in traditional computing workloads. The article quantifies the cascading impacts of these failures on training convergence, model accuracy, system performance, and operational economics. To address these challenges, the article develops and evaluates a suite of mitigation strategies spanning proactive monitoring techniques, predictive maintenance frameworks, fault-tolerant architectural designs, and software resilience mechanisms. Case studies across large-scale training clusters, edge deployments, and cloud environments provide contextual insights into reliability variations across deployment modalities. The article presented herein offers both theoretical frameworks for understanding GPU reliability in AI contexts and practical recommendations for infrastructure operators seeking to improve system resilience without compromising computational performance. As AI hardware continues its rapid evolution toward higher power densities and architectural complexity, the reliability engineering approaches established in this article provide essential guidance for the sustainable scaling of AI infrastructure.

Files

SJECS-97-2025-298-306.pdf

Files (730.3 kB)

Name Size Download all
md5:4357dd4ab44dc7f719787b085d22732b
730.3 kB Preview Download