Efficient and Eventually Consistent Collective Operations

doi:10.5281/zenodo.4588540

Published March 8, 2021 | Version v1

Preprint Open

Efficient and Eventually Consistent Collective Operations

1. Fraunhofer ITWM / Sorbonne University
2. INESC-ID & IST (ULisboa)
3. CINECA
4. Fraunhofer ITWM

Collective operations are common features of parallel programming models that are frequently used in High-Performance (HPC) and machine/ deep learning (ML/ DL) applications. In strong scaling scenarios, collective operations can negatively impact the overall application performance: with the increase in core count, the load per rank decreases, while the time spent in collective operations increases logarithmically.

In this article, we propose a design of eventually consistent collectives suitable for ML/ DL computations by reducing communication in Broadcast and Reduce, as well as by exploring the Stale Synchronous Parallel (SSP) synchronization model for the Allreduce collective. Moreover, we also enrich the GASPI ecosystem with frequently used classic/ consistent collective operations -- such as Allreduce for large messages and AlltoAll used in an HPC code. Our implementations show promising preliminary results with significant improvements, especially for Allreduce and AlltoAll, compared to the vendor-provided MPI alternatives.

Files

iPDPSw__collectives-16.pdf

Files (549.8 kB)

Name	Size	Download all
iPDPSw__collectives-16.pdf md5:c46e03010eefb07d87e1baca909e8604	549.8 kB	Preview Download

Additional details

EPEEC – European joint Effort toward a Highly Productive Programming Environment for Heterogeneous Exascale Computing (EPEEC) 801051: European Commission

	All versions	This version
Views	275	268
Downloads	165	164
Data volume	92.9 MB	92.4 MB

Efficient and Eventually Consistent Collective Operations

Creators

Description

Files

iPDPSw__collectives-16.pdf

Files (549.8 kB)

Additional details

Funding