Published January 7, 2022
| Version pre print
Journal article
Open
Experimental Findings on the Sources of Detected Unrecoverable Errors in GPUs
Creators
- 1. INRIA
- 2. STFC
- 3. UFRGS
- 4. Politecnico di Torino
Description
We investigate the sources of detected unrecoverable errors (DUEs) in graphics processing units (GPUs) exposed to a neutron beam. Illegal memory accesses and interface errors are among the more likely sources of DUEs. Error-correcting code (ECC) increases the launch failure events. Our test procedure has shown that ECC can reduce the DUEs caused by Illegal Address access up to 92% for Kepler and up to 98% for Volta. In addition, we analyze whether the compiler optimizations can impact the DUE sources distribution for the matrix multiplication. We found that the machine codes generated by the different optimization levels can change the DUE source by no more than 24% on average.
Files
FINAL_VERSION-2.pdf
Files
(2.7 MB)
Name | Size | Download all |
---|---|---|
md5:042bb93beb5ac0084a10b799b727a947
|
2.7 MB | Preview Download |