Published June 4, 2020 | Version v3
Software Open

Checkpointing vs. Supervision Resilience Approaches for Dynamic Tasks

Creators

Description

With the advent of exascale computing, issues such as application irregularity and permanent hardware failure are growing in importance. Irregularity is often addressed by work stealing of tasks, which is the target of two recent fault tolerance techniques: application-level checkpointing (ACP), and supervision/steal tracking (ST), respectively. These techniques have been devised for different task models: ACP for dynamic independent tasks (DIT), and ST for nested fork-join programs (FJ), respectively.

This paper transfers ST to the DIT setting, thus enabling a comparison between ACP and ST. The transfer includes several technical contributions. The comparison itself involves experiments, running time predictions, and simulations of job set executions. For both techniques, we consistently observe typical resilience overheads below 1%. The overheads are lower for ST in practically relevant cases, but ACP takes over for order millions of processes.
 

Files

artefact_CheckpointingVsSupervisionResilienceApproachesForDynamicTasks.zip

Files (315.0 kB)