Published May 20, 2024 | Version v1
Conference proceeding Open

Silent Data Corruptions in Computing Systems: Early Predictions and Large-Scale Measurements

  • 1. ROR icon National and Kapodistrian University of Athens
  • 2. ROR icon Meta (United States)

Description

Silent Data Corruptions (SDCs) due to defects in computing chips (CPUs, GPUs, AI accelerators) is a critical threat to the quality of large-scale computing in different application domains: cloud computing, high-performance computing, edge computing. Recent public reports by cloud hyperscalers have emphasized that apart from the usual suspects for SDCs (memory, storage, network), the heart of the computations, the processing elements of all types generate an unexpectedly large rate of SDCs which can cause erroneous calculations and severe information loss. We report, in a consolidated form, recent efforts to correlate early microarchitecture-level simulation-based predictions about the likelihood, rates, severity, and root causes of SDCs and large-scale in-field studies in cloud data centers. Early microarchitecture-level prediction of SDC characteristics (susceptible units, workloads, instructions) can shed light to the cryptic problem of SDCs. The findings of a diligent pre-silicon analysis can assist better understanding of SDCs and can thus drive effective protection decisions either at the hardware or at the software levels at deployment stages.

Files

ets2024_gizopoulos.pdf

Files (362.2 kB)

Name Size Download all
md5:c1ffe1e97b079d04bc3bc20c3fa589e7
362.2 kB Preview Download

Additional details

Funding

Vitamin-V – Virtual Environment and Tool-boxing for Trustworthy Development of RISC-V based Cloud Services 101093062
European Commission
NEUROPULS – NEUROmorphic energy-efficient secure accelerators based on Phase change materials aUgmented siLicon photonicS 101070238
European Commission