Published January 7, 2022 | Version pre print
Journal article Open

Experimental Findings on the Sources of Detected Unrecoverable Errors in GPUs

  • 1. INRIA
  • 2. STFC
  • 3. UFRGS
  • 4. Politecnico di Torino

Description

We investigate the sources of detected unrecoverable errors (DUEs) in graphics processing units (GPUs) exposed to a neutron beam. Illegal memory accesses and interface errors are among the more likely sources of DUEs. Error-correcting code (ECC) increases the launch failure events. Our test procedure has shown that ECC can reduce the DUEs caused by Illegal Address access up to 92% for Kepler and up to 98% for Volta. In addition, we analyze whether the compiler optimizations can impact the DUE sources distribution for the matrix multiplication. We found that the machine codes generated by the different optimization levels can change the DUE source by no more than 24% on average.

Files

FINAL_VERSION-2.pdf

Files (2.7 MB)

Name Size Download all
md5:042bb93beb5ac0084a10b799b727a947
2.7 MB Preview Download

Additional details

Funding

PERIOD – Pursuing Efficient Reliability of Object Detection for automotive and aerospace applications 886202
European Commission