Info: Zenodo’s user support line is staffed on regular business days between Dec 23 and Jan 5. Response times may be slightly longer than normal.

There is a newer version of the record available.

Published June 4, 2020 | Version v1
Technical note Open

Checkpointing vs. Supervision Resilience Approaches for Dynamic Tasks

Creators

Description

With the advent of exascale computing, issues such as application irregularity and permanent hardware failure are growing in importance. Irregularity is often addressed by work-stealing of tasks, which is the target of two recent resilience techniques. The first adopts application-level checkpointing, it keeps checkpoints consistent after steals. The second adopts supervision in combination with steal tracking, it lets parent tasks supervise/restart their children and identifies intact subtasks from distributed history information. These techniques have been designed for different task models.

This paper transfers steal tracking to the other task model, thus enabling a comparison. Contributions include the choice of supervisors and the definition of history information. The comparison itself involves experiments, running time predictions, and simulations of job set executions. We consistently observe lower overheads for steal tracking in practically relevant cases, but application-level checkpointing takes over for order millions of processes. Both techniques exhibit typical resilience overheads below 1%.
 

Files

artefact_CheckpointingVsSupervisionResilienceApproachesForDynamicTasks.zip

Files (288.8 kB)