Silent Data Errors: Sources, Detection, and Modeling
- 1. Auburn University
- 2. Intel
- 3. University of Athens
Description
Chip manufacturers and hyperscalers are becoming increasingly aware of the problem posed by Silent Data Errors (SDE) and are taking steps to address it. Major computing facilities operators like Meta and Google have emphasized the critical role of SDEs in today’s microprocessors. Numerous studies in the literature have highlighted the severity of this issue, especially in datacenter applications operating at large scales. These errors can lead to data loss and require a significant amount of time and effort to resolve through debugging engineering efforts, which can take months to complete. In this paper, we provide an overview of the issue of SDEs, including an explanation of the problem and the current methods used to address it, as well as gaps that still exist in addressing the issue. We also discuss the different sources of SDEs, including post-manufacturing testing failures, voltage and timing marginalities, and hard-to-detect faults. The paper emphasizes the impact of timing marginalities as a significant source of SDEs. Finally, our spotlight points to the architecture and system dimensions of the problem: we describe the challenges of measuring the true (still unknown) rates of SDE from CPUs, and emphasize on the role of detailed microarchitectural simulation models for this purpose. We present data on the severity of SDEs and their predicted rates under various operating conditions, sources of faults, and technology fabrication nodes.
Files
vts2023_singh.pdf
Files
(4.7 MB)
Name | Size | Download all |
---|---|---|
md5:bca9814fda3f272ea4bbaad1725ab549
|
4.7 MB | Preview Download |
Additional details
Funding
- NEUROPULS – NEUROmorphic energy-efficient secure accelerators based on Phase change materials aUgmented siLicon photonicS 101070238
- European Commission
- REBECCA – Reconfigurable Heterogeneous Highly Parallel Processing Platform for safe and secure AI 101097224
- European Commission
- Vitamin-V – Virtual Environment and Tool-boxing for Trustworthy Development of RISC-V based Cloud Services 101093062
- European Commission