Conference paper Embargoed Access
Erol Gelenbe; Pawel Boryzsko; Miltiadis Siavvas; Joanna Domanska
We study programs which operate in the presence of possible failures and which must be restarted from the beginning after each failure. In such systems checkpointsare introduced to reduce the large costs of program restarts when failures occur. Here we suggest that checkpoints should be introduced in a manner which assures effective reliability, while reducing both the computational overhead as much as possible, but also to save energy. We compute the total average program execution time in the presence of checkoints so as to limit the re-execution time of the program from the most recent checkpoint. We also study the total energy cnsumption of the program under the same conditions, and formulate an optimization problem to minimize a wighted sum of both average computation time and energy. This approach is placed in the context of Application Level Checkpointing and Restart (ALCR). We then focus on checkpoints placed at the beginning of a loop, and derive the optimum placement of checkpoints to minimize a weighted combination of the program's execution time and energy consumption. Numerical results are presented to illustrate the analysis. Finally we describe a software tool with a graphical interface that has been designed to assist a system designer in choosing the optimum checkpoint for a given program as a function of different failure rates and other parameters.
Files are currently under embargo but will be publicly accessible after December 21, 2022.