Optimum Checkpoints for Programs with Loops

Checkpoints are widely used to improve the performance of computer systems and programs in the presence of failures, and signiﬁcantly reduce the cost of restarting a program each time that it fails. Application level checkpointing has been proposed for programs which may execute on platforms which are prone to failures, and also to reduce the execution time of programs which are prone to internal failures. Thus we develop a mathematical model to estimate the average execution time of a program in the presence of failures, without and with application level checkpointing, and use it to estimate the optimum interval in number of instructions executed between successive checkpoints. The case of programs with loops and nested loops is also discussed. The results are illustrated with several numerical examples.


Introduction
Cloud and Fog Computing allows diverse software applications to run on complex interconnected systems where reliability and security can be of significant concern. Major failures in such systems occur [1], due to complex effects between various factors including human decisions and systemic interactions in the architecture, the software systems, and the network connections [2]. Furthermore, a recent report [3] states that "The main problems affecting the cloud are insecure interface APIs, shared resources, data breaches, malicious insiders, and misconfiguration issues" including active adversarial mechanisms [4]. Clearly, Cloud providers will do their best to improve the security and reliability of their platforms. However, we also need methods that can limit the average execution time of applications that run on the Cloud and Fog despite the intermittent failures of the platforms. This is particularly of interest for long-running applications or those that are run frequently and repeatedly.
One such mechanism that we investigate in this paper is the Application Level Checkpoint and Restart (ALCR) that is widely used to enhance the reliability of long-running programs [5][6][7] by periodically saving a copy or checkpoint of the current execution state of software. The most recent copy is the used to restart program execution in case of failure. Originally developed for transaction-oriented systems and databases [8][9][10][11][12], it has been widely adopted to improve the reliability of modern High Performance Computing (HPC) [13,14] software.
Long intervals of time between checkpoints will increase the overhead associated with system restart, while short intervals will increase the overhead caused by the checkpoints themselves. The checkpoint interval must then be optimized so as to minimize a program's expected execution time in the presence of failures [15][16][17]. In [18,19] the impact of asynchronous checkpointing strategies on the performance of distributed systems has been studied. Among the existing checkpointing strategies, ALCR [5,20] uses a small memory footprint [6,7], but requires significant expertise for the selection of source code locations in which checkpoints should be inserted. Yet existing ALCR tools and libraries facilitate the insertion of checkpoints in long-running loops, since computational loops constitute a significant source of failure-related re-executions [21,22]. However such tools do not provide a method to select the inter-checkpoint interval which has a significant influence on the average execution time of software.
In this paper we propose that the inter-checkpoint intervals in specific loop be selected optimally as a function of program failure rate, the execution cost for establishing a checkpoint, and the execution time related to restarting the rogram after a failure, based on a mathematical model. We suggest that this approach can be implemented as an API within an ALCR tool, to select the optimum checkpoint interval in program loops.
In the sequel, Section 2 reviews earlier work. Section 3 provides examples to help understand the ALCR mechanism and its associated costs. Section 4 describes the mathematical model and the numerical approach. The optimum checkpoint interval is discussed in Section 4.3. Section 5 presents numerical examples and Section 6 presents conclusions and future research.

Related Work
If no scheme is adopted to enhance the performance of a transaction oriented system in the presence of failures, all previously executed transactions would need to be re-executed in case of a failure. The Checkpoint and Rollback/Recovery mechanism saves a secure and faithful copy of the system state at predetermined instants (the checkpoints); in addition in the case of transaction oriented systems, it will save an "audit trail" of the sequence of transactions that were executed since the most recent checkpoint was established. In case of a failure, only the transactions that were saved in the audit trail since the most recent checkpoint are re-executed [10]. Multiple level checkpoints were introduced in [9,15] to deal with hierarchies of failures, and are also discussed in [23].
The selection of the optimum checkpoint interval (OCI) between two successive checkpoints will maximize the overall system or program availability [11], defined as the fraction of time when the system is available for useful operations. A badly chosen checkpoint interval results in high system response times and long average execution times [24,25]. Therefore, much research has focused on how system and failure rate parameters affect its value, as well as on providing formulas for the calculation of its optimum value [8,12,26]. More specifically, the first attempt for determining the OCI was made by Young [8], who provided a first-order approximation of the OCI, using analytical methods. Another notable attempt was made by Daly [26], who extended the work made by Young [8], by relaxing the first-order assumptions and consequently by providing higher order approximation of the OCI. In particular, Daly provided a perturbation solution for the optimization problem of selecting the inter-checkpoint interval and showed (through simulation) that the higher order model leads to more accurate estimations of the OCI compared to first-order models. Finally, significant contributions in the field of OCI calculation were made by Gelenbe et al. [9][10][11][12].
Software applications are also often hampered by failure-provoking implementation issues [27]. Fault tolerance mechanisms are required to enhance their reliability [28,29], and checkpointing is a useful solution [5,13]. However, modern applications are considerably more complex than early transaction-oriented systems [30]. Therefore, a periodic copy of their overall execution state should be taken, in order to enhance their reliability [21].
In [22] CPPC, an ALCR tool is presented to reduce the manual effort required by the developers, by automatically identifying judicious locations in which checkpoints can be introduced (in fact, long-running loops) and inserting checkpoints in the identified locations. In [21] an application-level checkpointing solution for hybrid MPI-OpenMP applications is suggested as an extension of CPPC. In [7] a library (CRAFT) was proposed for incorporating the application-level CR mechanism into software implemented in C++. Similarly to CPPC [22], the proposed library reduces the development time associated with the ALCR mechanism, by identifying lengthy loops and the automatic insertion of application-level checkpoints [43,44]. In [6] a tool named ITALC is proposed to assist developers to semi-automatically re-engineer software by introducing application-level checkpoints in automatically identified hotspots.
One shortcoming of existing ALCR tools and libraries is that they do not provide recommendations regarding the optimum checkpoint interval, which is manually selected by the developers usually in an arbitrary manner. However, not optimally selecting the checkpoint interval is a challenging problem for all CR mechanisms regardless of type since it may lead to important execution time overheads. In fact, several studies have examined the impact of checkpointing and especially of the checkpoint interval selection on the performance of software applications. For instance, in [45] Oldfield et al., using Daly's model [26] for the calculation of the OCI, examined the impact of application-directed checkpointing on next generation massively parallel processing (MPP) systems, which comprise hundreds of thousands of processors. The results of their study indicate that in the next generation systems traditional CR mechanisms will increasingly impact application performance, accounting for 50% of the total application execution time.
In [46], Stavrinides and Karatza investigated the impact of application-directed checkpointing on the performance of the Software-as-a-Service (SaaS) cloud. More specifically, they examined via simulation the impact of checkpoint interval selection on the scheduling performance of real-time fine-grained parallel applications with firm deadlines and approximate solutions that run on SaaS cloud, under different failure probabilities. The results of their study suggest that the checkpoint interval must be selected taking into account the failure probability, as well as the nature of the workload. In addition, the selected checkpoint interval should be above a specific threshold as unnecessarily frequent checkpointing may lead to performance degradation.
Based on the above analysis, the arbitrary selection of the checkpoint interval should be avoided, due to its potential impact on the application performance. To this end, we provide a numerical approach for the calculation of the optimum checkpoint interval of long-running loops, that minimizes the expected execution time of software applications that adopt the ALCR mechanism. The proposed method can be used along with the existing ALCR libraries, in order to enhance the performance of software applications. Figure 1: The additional code that should be inserted for adding an application-level checkpoint in a lengthy loop, using CRAFT (Adapted from [7]).

Indicative Examples
In this section, some examples are provided regarding the changes that should be performed to the source code of a software application, in order to add checkpoints into long-running loops, using actual ALCR libraries. Their purpose is to help the reader understand the overall concept of the ALCR mechanism, and also to explain why the arbitrary selection of the checkpoint interval may affect negatively the execution time of a software application. These examples are also expected to facilitate the understanding of the proposed mathematical model that is presented in Section 4, as well as the numerical examples that are provided in Section 5. Figure 1 demonstrates how the source code of a software application should be modified, in order to insert an application-level checkpoint in a long-running loop. As can be seen by the given example, a non-trivial number of library-specific statements (marked with red color) should be added to the program, in order to insert a checkpoint in the selected loop. The example is based on a specific ALCR library called CRAFT [7]. However, similar modifications are also required by other ALCR libraries, as can be seen by Figure 2.
The examples in Figures 1 and 2 show that the insertion of checkpoints into long-running loops necessitates significant source code modifications. In fact, a number of statements should be inserted, in order to determine: (i) the location of the checkpoints, (ii) what data should be safely stored every time a checkpoint is generated, and (iii) how frequently a checkpoint should be generated (e.g. the value of cpFreq in Figure  1). These additional statements are methods of a specific library, and may require several operations with significant execution time. For instance, the updateAndWrite() method in Figure 1, is expensive in execution time, since the creation of a checkpoint usually requires multiple memory accesses [6,7]. Thus the execution time of checkpointing must be added to the overall execution time of the software application. If failures are very infrequent and the cost of checkpointing is high, frequent checkpoints will result in an average execution time of the application that is higher than the same application which runs without checkpoints. Thus the total execution time of a program which contains checkpoints in its loops is presented in Section 4, and numerical examples are provided in Section 5.

Expected Execution Time of a Program Without and With Checkpoints
Consider a program P that executes a total of M instructions; it may contain loops so that M is the total number of instructions it executes. Assume that when the execution starts, there is an overhead associated with loading its data and code into memory, which consumes A time units. If the program is executed without any errors or failures, and if each instruction is executed in c time units, then the total execution time for P will be: Now suppose that no failures or errors occur during the initial and final durations A, B, however with probability g there may be a failure in any one of the instructions. We assume that the failure is detected after a delay which takes δ time units. It should be noted that the main notations and parameters that are used in this section are summarized in Table 1, in order to facilitate the readability and understandability of the mathematical equations presented in this section.

Expected Execution Time Without Checkpoints
When a failure is detected, the program has to be re-executed, and if the failures occur during further executions, the execution may have to be repeated several times. Let τ (P ) denote the total execution time of the program, and let Eτ (P ) be its expected value. Then: If a failure occurs, this only becomes known after δ time units, the program has to be restarted and run again, so that the time A + c.u + δ has been wasted. When there are no failures we see from (3) that since: When g is very small so that gM << 1, we can use the following approximation directly from (3):

Estimating the Failure Probability g
In order to use the above expressions, we will need (1 − g) M the probability that no error or failure will occur during the program's execution, and the probability that at least one failure occurs during the program's execution is F = 1 − (1 − g) M . Note that the notion of a failure in this case is that of any event that stops the execution of the program and which arises from the program's execution environment, i.e. the platform. If gM << 1 then F ≈ gM .
The value of g can be estimated as follows. Take a simple linear code that executes M instructions, and then repeats the execution, i.e. a single loop containing M sequential instructions. This code should not contain any ALCR or other checkpointing constructs.
1. Run the program repeatedly. Each time the program returns to the first instruction, increment the counter N ← N + 1.
2. If the program execution stops, increment a counter N F ← N F + 1. Update g ≈ N F N.M . 3. Then restart the program at its initial instruction and set N ← N + 1.

Optimum Checkpoints
When the program must run for a long time, i.e. when M is large of failure M g cannot be neglected, checkpoints can be placed at periodic intervals, say after K instructions are executed, but they result in a cost B(K) in the amount of time needed to create the checkpoint, since the status of the program and all its data must be saved. B(K) may be an increasing function of K when the data that the program has modified during the interval of execution of K instructions needs to be saved. Thus the program will now execute a total of M instructions in successive blocks of b(M, K) = M K instructions, all of which are of length K, except for the last one of length . Applying the previous analysis, we compute the total average execution time of the program with checkpoints: Therefore optimum checkpoint interval K * is the value of K that minimizes E cp τ (P ), which can be computed numerically from (17). In order to better illustrate the benefit of the ALCR we also define the percentage Gain: where Eτ (P ) is the expected execution time of the program (or software application) P when ALCR is not used.

Program with a Long Loop
Suppose that a program contains a single loop with L instructions that is executed repeatedly n times so that the program executes M = n.L instructions. If a checkpoint is inserted for each I loops so that the block of executed instructions between checkpoints is of length K = I.L, then a total of b(nL, IL) = n I − 1 checkpoints are placed, since the start of the loop will in itself require a checkpoint. From equation (9) with M = n.L and K = I.L we have: If the number of instructions that are executed during a single loop iteration is L, the optimum number of iterations between two successive checkpoints is I * = K * L .

Nested Loops
Suppose that we identify, either manually or using an ALCR library, that the best location for adding checkpoints is a loop that contains one or more internal loops. These internal loops can be treated in a black-box manner as normal statements (e.g. method calls), which require the execution of a number of instructions. The number of instructions executed in the internal loops can be used to calculate the values of L and M of the selected outer loop yielding the optimum number of loop iterations between checkpoints I * .
Another possible approach, especially in the case of nested loops with multiple levels of nesting, is to treat the nested loops individually. In this case, checkpoints could be generated in the innermost loop. However the latter approach would require additional development effort, in order to map iterations to actual instructions, since ALCR libraries usually focus on iterations.

Numerical Examples
In this section, the effect of the checkpoint interval K on the expected execution time of a software application is illustrated through a set of numerical examples. More specifically, the case of a software application with a single loop is considered, and the analysis is repeated for different loop sizes (i.e., different values of M). For each one of these cases, the expected execution time of the same program with and without the adoption of the ALCR mechanism is calculated, while the optimum checkpoint interval is computed based on the mathematical model proposed in Section 4. In particular, the results presented in this section were obtained through simulation, by implementing the mathematical equations presented in Section 4 in MATLAB, and by solving the corresponding optimization problem numerically.
For the purposes of the present experiment, four cases of loop sizes were considered. In particular, we considered the cases of a small, medium, large, and very large loop, comprising 10 3 , 10 4 , 10 5 , and 10 7 iterations respectively. We selected these values of M so that there would be at least one order of magnitude of difference between the loop sizes of the different numerical examples. The reasoning behind this option is that we wanted to examine whether the same behavior is observed and the same observations hold for significantly different loop sizes.
In Figure 3, the case of a software application with a relatively small loop having M = 1000 is presented. Figure 3a compares the expected execution time of the application with and without the ALCR mechanism for different values of K, while Figure 3b shows the expected Gain of Section 4.3 for different values of K. The values that correspond to the optimum checkpoint interval K * are marked within a rectangle. Figure  3 illustrates the fact that the optimum checkpoint interval K * minimizes the overall execution time of the application and maximizes the overall expected Gain. Therefore, the ALCR mechanism will not reduce the expected execution time of a given software application unless the checkpoint interval is optimally selected. Indeed, for some poorly chosen values of K, the expected execution time of the application with checkpointing is higher than the expected execution time of the same application without checkpoints. Similar observations can be made for software with longer loops in Figures 4, 5 and 6. This emphasizes the importance of setting K to be close or at K * .  The examples of Figures 3, 4, 5 and 6 show that a significant reduction in the execution time of a software application can be achieved by the ALCR mechanism, if the checkpoint interval is selected to be at, or close to, the optimum K * . In these examples the Gain, ranges from 40% to 60%. However, suboptimal values of the checkpoint interval will lead to a smaller Gain or even to an average execution time which is larger than when ALCR is not used. Indeed, the checkpoint interval should not be selected arbitrarily and must be tuned to a value at, or close to, the optimum K * .

Conclusions and Future Work
This paper has proposed a method for setting the checkpoint intervals in ALCR for software applications which contain long-running loops, and which run on platforms that are subject to failures. We have shown that the optimum checkpoint interval, which minimizes the expected execution time of the program, depends on various parameters which can be incorporated into a single numerical expression. The expression can then be used as part of an ALCR tool to compute the optimum checkpoint interval for each individual loop in the program. The approach can be used through a set of MATLAB scripts that calculate the optimum checkpoint interval of different computational loops. We have illustrated our results via different numerical examples.
Several directions for future work can be considered. Energy consumption is a critical property that will be affected by checkpointing, as well as by the re-execution of a program in case of failure. Thus in recent work [47], checkpointing was considered from the perspective of its effect on energy consumption. Further research is needed to see how the checkpoint interval can be selected to achieve a compromise between energy consumption and execution times. The issue can become quite intricate if we consider the effect of the secondary memory medium. Since checkpointing will generally increase the use of secondary memory, and secondary memory related failures may increase with the amount of usage, a platform where many applications use ALCR, may have a failure rate which increases with usage and age. Thus a time dependent value of the failure probability g will need to be considered in this case. Furthermore, with rotating secondary memories and some other devices, the use of ALCR will also increase energy consumption. These are all questions that require further research. In addition, an important parameter of time-critical software applications, apart from their overall performance and energy consumption, is the satisfaction of strict deadlines. More specifically, time-critical programs that run on systems like SaaS and HPC clouds, often require their execution to be completed before specific deadlines. Failure to meet these time restrictions often lead to the rejection of their results. Hence, meeting these deadlines is critical, whereas it can be affected (either positively or negatively) by the adoption of checkpointing. Therefore, in the future, we are planning to take into account this parameter for the calculation of the optimum checkpoint interval, along with the performance and energy consumption of the applications.
Finally, another interesting area of investigation is the use of ALCR to restart applications after attacks. Indeed, ALCR could perhaps be used to disrupt the attacker, but in turn attackers may exploit ALCR to create an increase in workload in a system, leading to a form of Denial of Service through workload saturation. Thus the interaction of ALCR and checkpointing in general, and security, is also a worthwhile subject of investigation.