Published September 27, 2025 | Version v1
Working paper Open

GPU Behavior Genome: Stable, Change-Sensitive Embeddings for Fleet-Level GPU Telemetry in NASA HPC

  • 1. ROR icon National Aeronautics and Space Administration
  • 2. EDMO icon Brown University

Description

GPU Behavior Genome: Stable, Change-Sensitive Embeddings for Fleet-Level GPU Telemetry in NASA HPC introduces GBG, a self-supervised representation learning system that produces a per-GPU fingerprint—a compact embedding that remains stable under normal workload drift yet reacts quickly to meaningful configuration, firmware, or cooling changes.

Unlike current DCGM dashboards and rule-based monitoring, GBG provides a semantic identity for each GPU across workloads and maintenance cycles. It enables:

  • Early warning of degradation and misconfigurations with few-shot checks

  • Fleet-scale forensics, answering “which nodes looked like those that later failed?”

  • Cross-generation transfer across GPU families with safe onboarding for new architectures

  • Adaptive verification via safety-aware contextual bandits that balance certainty with operational budgets

  • Explainability through Integrated Gradients and TimeSHAP evidence packs for operator trust

Benchmarked against strong baselines (DCGM+rules, SR-CNN, Matrix Profile, LSTM-AE, Isolation Forest), GBG achieves high stability, accurate detection of staged changes, and efficient fleet-level operation with bounded overhead. Designed for NASA HPC clusters but generalizable to large-scale GPU fleets, GBG reframes monitoring from “threshold and react” to “fingerprint and verify.”

This work provides reproducibility artifacts, evaluation protocols, and deployment guidance, establishing a blueprint for embedding-centric GPU observability in mission operations and beyond.

Files

GPU_Behavior_Genome____Stable_Change_Sensitive_Embeddings_for_Fleet_Level_GPU_Telemetry_in_NASA_HPC.pdf