GPU Behavior Genome: Stable, Change-Sensitive Embeddings for Fleet-Level GPU Telemetry in NASA HPC
Authors/Creators
Description
GPU Behavior Genome: Stable, Change-Sensitive Embeddings for Fleet-Level GPU Telemetry in NASA HPC introduces GBG, a self-supervised representation learning system that produces a per-GPU fingerprint—a compact embedding that remains stable under normal workload drift yet reacts quickly to meaningful configuration, firmware, or cooling changes.
Unlike current DCGM dashboards and rule-based monitoring, GBG provides a semantic identity for each GPU across workloads and maintenance cycles. It enables:
-
Early warning of degradation and misconfigurations with few-shot checks
-
Fleet-scale forensics, answering “which nodes looked like those that later failed?”
-
Cross-generation transfer across GPU families with safe onboarding for new architectures
-
Adaptive verification via safety-aware contextual bandits that balance certainty with operational budgets
-
Explainability through Integrated Gradients and TimeSHAP evidence packs for operator trust
Benchmarked against strong baselines (DCGM+rules, SR-CNN, Matrix Profile, LSTM-AE, Isolation Forest), GBG achieves high stability, accurate detection of staged changes, and efficient fleet-level operation with bounded overhead. Designed for NASA HPC clusters but generalizable to large-scale GPU fleets, GBG reframes monitoring from “threshold and react” to “fingerprint and verify.”
This work provides reproducibility artifacts, evaluation protocols, and deployment guidance, establishing a blueprint for embedding-centric GPU observability in mission operations and beyond.
Files
GPU_Behavior_Genome____Stable_Change_Sensitive_Embeddings_for_Fleet_Level_GPU_Telemetry_in_NASA_HPC.pdf
Files
(379.9 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:800fd8d5b57daf1bd3b09913c670d9a7
|
379.9 kB | Preview Download |