From tasks graphs to asynchronous distributed checkpointing with local restart

Romain Lion; Samuel Thibault

doi:10.1109/FTXS51974.2020.00009

Published December 30, 2020 | Version v1

Conference paper Open

From tasks graphs to asynchronous distributed checkpointing with local restart

1. University of Bordeaux, Inria Bordeaux - Sud-Ouest, Bordeaux, France

The ever-increasing number of computation units assembled in current HPC platforms leads to a concerning increase in fault probability. Traditional checkpoint/restart strategies avoid wasting large amounts of computation time when such fault occurs. With the increasing amount of data dealt with by current applications, these strategies however suffer from their data transfer demand becoming unreasonable, or the entailed global synchronizations. Meanwhile, the current trend towards task-based programming is an opportunity to revisit the principles of the checkpoint/restart strategies. We here propose a checkpointing scheme which is closely tied to the execution of task graphs. We describe how it allows for completely asynchronous and distributed checkpointing, as well as localized node restart, thus opening up for very large scalability. We also show how a synergy between the application data transfers and the checkpointing transfers can lead to a reasonable additional network load, measured to be lower than +10 % on a dense linear algebra example.

Files

publication-1.pdf

Files (384.7 kB)

Name	Size	Download all
publication-1.pdf md5:f53e78a05c8d25c7ab18b83844c9c75c	384.7 kB	Preview Download

Additional details

European Commission
EXA2PRO - Enhancing Programmability and boosting Performance Portability for Exascale Computing Systems 801015

	All versions	This version
Views	446	438
Downloads	189	188
Data volume	74.6 MB	74.2 MB

From tasks graphs to asynchronous distributed checkpointing with local restart

Authors/Creators

Description

Files

publication-1.pdf

Files (384.7 kB)

Additional details

Funding