

# Advanced parallel programming – MPI+X MPI + OpenMP + OpenMP offloading

Claudia Blaas-Schenner and Ivan Vialov

VSC Research Center, TU Wien, Vienna, Austria

TREX Workshop: Code Tuning for the Exacale @ Bratislava, June 5, 2023

#### **Abstract**

TREX Workshop: Code Tuning for the Exascale Slovak Academy of Sciences, Bratislava, Slovakia Day 1 - 05.06.2023

Claudia Blaas-Schenner and Ivan Vialov (VSC Research Center, TU Wien, Vienna, Austria)

Advanced parallel programming – MPI+X: Modern HPC systems are clusters of shared-memory nodes and especially the pre-exascale and exascale systems are accelerated with one to several GPUs per node. While the Message Passing Interface (MPI) is the dominant model to parallelize across nodes, there is a need to combine MPI with other programming paradigms such as OpenMP to fully exploit shared-memory within the nodes and to be able to offload heavy compute task to the GPUs.

In this one day tutorial, we will briefly cover MPI+OpenMP+OpenMP offloading.

We will explain how to properly tackle NUMA (non-uniform memory access) architectures and put a special focus on pinning. In the hands-on labs we will play around with affinity and the participants will get a good grasp about how pinning influences performance.

https://trex-coe.eu/events/trex-workshop-code-tuning-exascale



## **Acknowledgement** → subset of:



## **Hybrid Programming in HPC – MPI+X**

Claudia Blaas-Schenner<sup>1)</sup>

Georg Hager<sup>2)</sup>

Rolf Rabenseifner<sup>3)</sup>

rabenseifner@hlrs.de

claudia.blaas-schenner@tuwien.ac.at

- <sup>1)</sup> VSC Research Center, TU Wien, Vienna, Austria (hands-on labs)
- <sup>2)</sup> Erlangen National High Performance Computing Center (NHR@FAU), FAU, Germany
- <sup>3)</sup> High Performance Computing Center (HLRS), University of Stuttgart, Germany,

PTC ONLINE COURSE @ VSC Vienna, Dec 12-14, 2022

http://tiny.cc/MPIX-VSC

https://doi.org/10.5281/zenodo.7566873

### **General outline**

Introduction

#### **Programming Models**

- MPI + OpenMP on multi/many-core (14) + Exercises
- MPI + Accelerators (88) + Exercises

### Introduction

Hardware and programming models
Hardware Bottlenecks
Questions addressed in this tutorial
Remarks on Cost-Benefit Calculation

## Hardware and programming models



- MPI + threading
  - OpenMP
  - Cilk(+)
  - TBB (Threading Building Blocks)
- MPI + MPI shared memory
- MPI + accelerator
  - OpenACC
  - OpenMP accelerator support
  - CUDA
  - OpenCL, Kokkos, SYCL,...
- Pure MPI communication

## Options for running code on multicore clusters



- Which programming model is fastest?
  - MPI everywhere?



Fully hybrid MPI & OpenMP?



 Something between? (Mixed model)



- Often hybrid programming slower than pure MPI

Examples, Reasons,

## More Options with accelerators



#### Hierarchical hardware

Many levels

#### Hierarchical parallel programming

- Many options for MPI+X: one MPI process per
  - node
  - CPU
  - ccNUMA domain
  - [...]
  - core
  - hyper-thread

bottleneck?

#### Dual-CPU ccNUMA + accelerator node architecture

#### Actual topology of a modern compute node



#### Hardware bottlenecks

- Multicore cluster
  - Computation
  - Memory bandwidth
  - Intra-CPU communication (i.e., core-to-core)
  - Intra-node communication (i.e., CPU-to-CPU)
  - Inter-node communication
- Cluster with CPU+Accelerators
  - Within the accelerator
    - Computation
    - Memory bandwidth
    - Core-to-Core communication
  - Within the CPU and between the CPUs
    - See above
  - Link between CPU and accelerator



## Example: Hardware bottlenecks in SpMV

- Sparse matrix-vector-multiply with stored matrix entries
  - > Bottleneck: memory bandwidth of each CPU

SpMV with calculated matrix entries

(many complex operations per entry)

- Bottleneck: computational speed of each core
- SpMV with highly scattered matrix entries
  - Bottleneck: Inter-node communication



#### Questions addressed in this tutorial

- What is the performance impact of system topology?
- How do I map my programming model on the system to my advantage?
  - How do I do the split into MPI+X?
  - Where do my processes/threads run? How do I take control?
  - Where is my data?
  - How can I minimize communication overhead?
- How does hybrid programming help with typical HPC problems?
  - Can it reduce communication overhead?
  - Can it reduce replicated data?
- How can I leverage multiple accelerators?
  - What are typical challenges?





## Programming models

- MPI + OpenMP on multi/many-core + Exercises
- MPI + MPI-3.0 shared memory + Exercise
- Pure MPI communication + Exercise
- MPI + Accelerators

# Programming models - MPI + OpenMP

| General considerations                             | slide <u>15</u> |
|----------------------------------------------------|-----------------|
| How to compile, link, and run                      | <u>20</u>       |
| Hands-on: Hello hybrid!                            | <u>29</u>       |
| System topology, ccNUMA, and memory bandwidth      | <u>31</u>       |
| Memory placement on ccNUMA systems                 | <u>39</u>       |
| Topology and affinity on multicore                 | <u>48</u>       |
| Hands-on: Pinning                                  | <u>59</u>       |
| Case study: The Multi-Zone NAS Parallel Benchmarks |                 |
| Hands-on: Masteronly hybrid Jacobi                 | <u>61</u>       |
| Overlapping communication and computation          | <u>64</u>       |
| Communication overlap with OpenMP taskloops        | <u>70</u>       |
| Hands-on: Taskloop-based hybrid Jacobi             | <u>76</u>       |
| Main advantages, disadvantages, conclusions        | <u>77</u>       |

## Programming models

- MPI + OpenMP

#### General considerations

#### > General considerations

How to compile, link, and run

Hands-on: Hello hybrid!

System topology, ccNUMA, and memory bandwidth

Memory placement on ccNUMA systems

Topology and affinity on multicore

Hands-on: Pinning

Case study: The Multi-Zone NAS Parallel Benchmarks

Hands-on: Masteronly hybrid Jacobi

Overlapping communication and computation

Communication overlap with OpenMP taskloops

Hands-on: Taskloop-based hybrid Jacobi

Main advantages, disadvantages, conclusions

## Potential advantages of MPI+OpenMP

#### Simple level

- Leverage additional levels of parallelism
  - Scaling to higher number of cores
  - Adding OpenMP with incremental additional parallelization
- Enable flexible load balancing on OpenMP level
  - Fewer MPI processes leave room for assigning workload more evenly
  - MPI processes with higher workload could employ more threads
  - Cheap OpenMP load balancing (tasking, dynamic/guided loops)
- Lower communication overhead (possibly)
  - Few "fat" MPI processes vs many "skinny" processes
  - Fewer messages and smaller amount of data communicated
- Lower memory requirements due to fewer MPI processes
  - Reduced amount of application halos & replicated data
  - Reduced size of MPI internal buffer space

#### Advanced level

Explicit communication/computation overlap

## MPI + any threading model

#### Special MPI init for multi-threaded MPI processes is required:

• Possible values for thread level required (increasing order):

```
- MPI THREAD SINGLE Only one thread will execute
```

- MPI\_THREAD\_FUNNELED Only main<sup>1)</sup> thread will make MPI-calls
- MPI THREAD SERIALIZED Multiple threads may make MPI-calls, but only one at a time
- MPI THREAD MULTIPLE Multiple threads may call MPI, with no restrictions

returned thread\_level\_provided may be less or more than thread\_level\_required

```
→ if (thread_level_provided < thread_level_required) MPI_Abort(...);</pre>
```

recommended directly after MPI Init thread

may imply higher latencies due to

some internal locks

Main thread = thread that called MPI\_Init\_thread.
Recommendation: Start MPI\_Init\_thread from OpenMP master thread → OpenMP master = MPI main thread

## Hybrid MPI+OpenMP masteronly style

```
for (iterations) {
    #pragma omp parallel
        numerical code
    /*end omp parallel */

    /* on master only */
        MPI_Isend();
        MPI_Irecv();
        MPI_Waitall();
} /* end for loop */
```

masteronly style: MPI only outside of parallel regions

#### Advantages

- Simplest possible hybrid model
- Thread-parallel execution and MPI communication strictly separate
- Minimally required MPI thread support level:
   MPI\_THREAD\_FUNNELED

#### **Major Problems**

- All other threads are sleeping while master thread communicates!
- Only one thread per process communicating
  - → possible underutilization of network bandwidth

## Masteronly style within large parallel region

```
#pragma omp parallel
for(iterations) {
  #pragma omp for
  for(i=0; ...) {
   // ... numerics
  } // barrier here
  #pragma omp single
    MPI Isend();
    MPI Irecv();
    MPI Waitall();
  } // Barrier here
} /* end iter loop */
```

- Barrier before MPI required
  - May be implicit
  - Prevent race conditions on communication buffer data
    - Between multi-threaded numerics
    - and MPI access by master thread
  - Enforce flush of variables
- Barrier after MPI required
  - May be implicit
  - Numerical loop(s) may need communicated data

## Programming models

- MPI + OpenMP

How to compile, link, and run

General considerations

> How to compile, link, and run

Hands-on: Hello hybrid!

System topology, ccNUMA, and memory bandwidth

Memory placement on ccNUMA systems

Topology and affinity on multicore

Hands-on: Pinning

Case study: The Multi-Zone NAS Parallel Benchmarks

Hands-on: Masteronly hybrid Jacobi

Overlapping communication and computation

Communication overlap with OpenMP taskloops

Hands-on: Taskloop-based hybrid Jacobi

Main advantages, disadvantages, conclusions

## How to compile, link and run

- Use appropriate OpenMP compiler switch (-openmp, -fopenmp, -mp, -qsmp=openmp, ...) and MPI compiler script (if available)
- Link with MPI library
  - Usually wrapped in MPI compiler script
  - If required, specify to link against thread-safe MPI library
    - Often automatic when OpenMP or auto-parallelization is switched on
- Running the code
  - Highly non-portable consult system docs (if available...)
  - Figure out how to start fewer MPI processes than cores per node
  - Pinning (who is running where?) is extremely important → see later

## Compiling from a single source

#### Make use of pre-defined symbols

```
#ifdef OPENMP # OPENMP defined with -qopenmp
      // all that is special for OpenMP
#endif
#ifdef USE MPI # USE MPI defined with -DUSE MPI
      // all that is special for MPI
#endif
#ifdef USE MPI
      MPI Init(...);
      MPI Comm rank(..., &rank);
      MPI Comm size(..., &size);
            # recommended for non-MPI
#else
       rank = 0;
       size = 1:
#endif
```

## Compiling from a single source

#### Handling compilers

Intel MPI + Intel C

```
mpiicc -DUSE_MPI -qopenmp ...
icc -qopenmp ...
```

Intel MPI + Intel Fortran

```
mpiifort -fpp -DUSE_MPI -qopenmp ...
ifort -fpp -qopenmp ...
```

OpenMPI + gcc

```
mpicc -DUSE_MPI -fopenmp ...
gcc -fopenmp ...
```

OpenMPI + gfortran

```
mpif90 -cpp -DUSE_MPI -fopenmp ...
gfortran -cpp -fopenmp ...
```

## Examples for compilation and execution

- Cray XC40 (2 NUMA domains w/ 12 cores each), one process (12 threads) per socket
  - ftn -h omp ...
  - OMP\_NUM\_THREADS=12 aprun -n 4 -N 2 \
    -d \$OMP\_NUM\_THREADS ./a.out
- Intel Ice Lake (36-core 2-socket) cluster, Intel MPI/OpenMP, one process (36 threads) per socket
  - mpiifort -qopenmp ...
  - mpirun -ppn 2 -np 4 \
    - -env OMP\_NUM\_THREADS 36
      - -env I\_MPI\_PIN\_DOMAIN socket \
      - -env KMP\_AFFINITY scatter ./a.out

## Examples for compilation and execution

- Intel Ice Lake (36-core 2-socket) cluster, Intel MPI/OpenMP + likwid-mpirun, one process (36 threads) per socket
  - mpiifort -qopenmp ...
  - likwid-mpirun -np 4 -pin S0:0-35\_S1:0-35 ./a.out
- Intel Skylake (24-core 2-socket) cluster, GCC + OpenMPI 4.1, one process (24 threads) per socket
  - mpif90 -fopenmp ...
  - OMP\_NUM\_THREADS=24 OMP\_PLACES=cores OMP\_PROC\_BIND=close \
     mpirun --map-by ppr:1:socket:PE=24 ./a.out
  - Dito, two processes per socket (12 threads each)
    OMP\_NUM\_THREADS=12 OMP\_PLACES=cores OMP\_PROC\_BIND=close \
     mpirun --map-by ppr:2:socket:PE=12 ./a.out

## Learn about node topology

- A collection of tools is available
  - numactl --hardware (numatools)
  - lstopo --no-io (part of hwloc)
  - cpuinfo -A (part of Intel MPI)
  - likwid-topology (part of LIKWID tool suite <a href="http://tiny.cc/LIKWID">http://tiny.cc/LIKWID</a>)



## Learning about node topology



## Learning about node topology



# Programming models

- MPI + OpenMP

Hands-On #1

Hello hybrid!

General considerations

How to compile, link, and run

> Hands-on: Hello hybrid!

System topology, ccNUMA, and memory bandwidth Memory placement on ccNUMA systems

Topology and affinity on multicore

Hands-on: Pinning

Case study: The Multi-Zone NAS Parallel Benchmarks

Hands-on: Masteronly hybrid Jacobi

Overlapping communication and computation

Communication overlap with OpenMP taskloops

Hands-on: Taskloop-based hybrid Jacobi

Main advantages, disadvantages, conclusions

#### Hands-On #1

he-hy - Hello Hybrid! - compiling, starting

- 1. FIRST THINGS FIRST PART 1: find out about a (new) cluster login node
- 2. FIRST THINGS FIRST PART 2: find out about a (new) cluster batch jobs
- 3. MPI+OpenMP: :**TODO**: how to compile and start an application how to do conditional compilation
- 4. MPI+OpenMP: :TODO: get to know the hardware needed for pinning

→ see: TODO.README

# Programming models - MPI + OpenMP

System topology, ccNUMA, and memory bandwidth

General considerations

How to compile, link, and run

Hands-on: Hello hybrid!

> System topology, ccNUMA, and memory bandwidth

Memory placement on ccNUMA systems

Topology and affinity on multicore

Hands-on: Pinning

Case study: The Multi-Zone NAS Parallel Benchmarks

Hands-on: Masteronly hybrid Jacobi

Overlapping communication and computation

Communication overlap with OpenMP taskloops

Hands-on: Taskloop-based hybrid Jacobi

Main advantages, disadvantages, conclusions

## What is "topology"?

#### Where in the machine does core (or hardware thread) #n reside?



Why is this important?

- Resource sharing (cache, data paths)
- Communication efficiency (shared vs. separate caches, buffer locality)
- Memory access locality (ccNUMA!)

## Compute nodes – caches

| Latency | ← typical → | Bandwidth           |
|---------|-------------|---------------------|
| 1–2 ns  | L1 cache    | 200 GB/s            |
| 3–10 ns | L2/L3 cache | 50 GB/s             |
| 100 ns  | memory      | 20 GB/s<br>(1 core) |



## Ping-Pong Benchmark – Latency

#### Intra-node vs. inter-node on VSC-3

- nodes = 2 sockets (Intel Ivy Bridge) with 8 cores + 2 HCAs
- inter-node = IB fabric = dual rail Intel QDR-80 = 3-level fat-tree (BF: 2:1 / 4:1)



```
myID = get process ID()
if(myID.eq.0) then
  targetID = 1
  S = get walltime()
  call Send message(buffer,N,targetID)
  call Receive message (buffer, N, targetID)
  E = get walltime()
  GBYTES = 2*N/(E-S)/1.d9 ! Gbyte/s rate
  TIME = (E-S)/2*1.d6! transfer time
else
  targetID = 0
  call Receive message(buffer,N,targetID)
  call Send message(buffer,N,targetID)
endif
```

| Latency      | MPI_Send() |           |  |
|--------------|------------|-----------|--|
| [µs]         | OpenMPI    | Intel MPI |  |
| intra-socket | 0.3 µs     | 0.3 µs    |  |
| inter-socket | 0.6 µs     | 0.7 μs    |  |
| IB -1- edge  | 1.2 µs     | 1.4 µs    |  |
| IB -2- leaf  | 1.6 µs     | 1.8 µs    |  |
| IB -3- spine | 2.1 µs     | 2.3 µs    |  |

| For comparison:<br>typical latencies |         |  |
|--------------------------------------|---------|--|
| L1 cache                             | 1–2 ns  |  |
| L2/L3 c.                             | 3–10 ns |  |
| memory                               | 100 ns  |  |
| HPC<br>networks                      | 1–10 µs |  |

→ Avoiding slow data paths is the key to most performance optimizations!

### Ping-Pong 1-on-1 Benchmark – Effective Bandwidth



## Multiple communicating rings

Benchmark halo\_irecv\_send\_multiplelinks\_toggle.c

- Varying message size,
- number of communication cores per CPU, and

See HLRS online courses http://www.hlrs.de/training/self-study-materials

- → Practical → MPI.tar.gz
- → subdirectory MPI/course/C/1sided/



# OpenMP barrier synchronization cost

Comparison of barrier synchronization cost with increasing number of threads

- 2x Haswell 14-core (CoD mode)
- Optimistic measurements (repeated 1000s of times)
- No impact from previous activity in cache
- → Barrier sync time highly dependent on system topology & OpenMP runtime implementation



#### Accumulated bandwidth saturation vs. # cores



Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

# Programming models - MPI + OpenMP

Memory placement on ccNUMA systems

General considerations

How to compile, link, and run

Hands-on: Hello hybrid!

System topology, ccNUMA, and memory bandwidth

> Memory placement on ccNUMA systems

Topology and affinity on multicore

Hands-on: Pinning

Case study: The Multi-Zone NAS Parallel Benchmarks

Hands-on: Masteronly hybrid Jacobi

Overlapping communication and computation

Communication overlap with OpenMP taskloops

Hands-on: Taskloop-based hybrid Jacobi

Main advantages, disadvantages, conclusions

### A short introduction to ccNUMA

#### ccNUMA:

- whole memory is transparently accessible by all processors
- but physically distributed
- with varying bandwidth and latency
- and potential contention (shared memory paths)
- Memory placement occurs with OS page granularity (often 4 KiB)



#### How much bandwidth does non-local access cost?

■ Example: AMD "Naples" 2-socket system (8 chips, 2 sockets, 48 cores):

STREAM Triad bandwidth measurements [Gbyte/s]



# Avoiding locality problems

- How can we make sure that memory ends up where it is close to the CPU that uses it?
  - See next slides (first-touch initialization)
- How can we make sure that it stays that way throughout program execution?
  - See later in the tutorial (pinning)

Taking control is the key strategy!

# Solving Memory Locality Problems: First Touch

"Golden Rule" of ccNUMA:
 A memory page gets mapped into the local memory of the processor that first touches it!



- Consequences
  - Process/thread-core affinity is decisive!
  - With OpenMP, data initialization code becomes important even if it takes little time to execute ("parallel first touch")
  - Parallel first touch is automatic for pure MPI
  - If thread team does not span across NUMA domains, memory mapping is not a problem
- Automatic page migration may help if memory is used long enough

## Solving Memory Locality Problems: First Touch

"Golden Rule" of ccNUMA:

A memory page gets mapped into the local memory of the processor that first touches it!

- Except if there is not enough local memory available
- Some OSs allow to influence placement in more direct ways
  - → libnuma (Linux)
- Caveat: "touch" means "write," not "allocate" or "read"
- Example:

```
double *huge = (double*)malloc(N*sizeof(double));
// memory not mapped yet
for(i=0; i<N; i++) // or i+=PAGE_SIZE
   huge[i] = 0.0; // mapping takes place here!</pre>
```



# Most simple case: explicit initialization

```
integer,parameter :: N=10000000
double precision A(N), B(N)
A=0.d0
!$OMP parallel do
do i = 1, N
 B(i) = function (A(i))
end do
!$OMP end parallel do
```

```
integer, parameter :: N=10000000
double precision A(N),B(N)
!$OMP parallel
!$OMP do schedule(static)
do i = 1, N
 A(i) = 0.d0
end do
!$OMP end do
!$OMP do schedule(static)
do i = 1, N
 B(i) = function (A(i))
end do
!$OMP end do
!$OMP end parallel
```

# Handling ccNUMA in practice

- Solution A
  - One (or more) MPI process(es) per ccNUMA domain
  - Pro: optimal page placement (perfectly local memory access) for free
  - Con: higher number (>1) of MPI processes on each node
- Solution B
  - One MPI process per node or one MPI process spans multiple ccNUMA domains
  - Pro: Smaller number of MPI processes compared to Solution A
  - Cons:
    - Explicitly parallel initialization needed to "bind" the data to each ccNUMA domain
       → otherwise loss of performance
    - Dynamic/guided schedule or tasking → loss of performance
- Thread binding is mandatory for A and B! Never trust the defaults!

# Conclusions from the observed topology effects

- Know your hardware characteristics:
  - Hardware topology (use tools such as likwid-topology)
  - Typical hardware bottlenecks
    - These are independent of the programming model!
  - Hardware bandwidths, latencies, peak performance numbers
- Know your software characteristics
  - Typical numbers for communication latencies, bandwidths
  - Typical OpenMP overheads
- Learn how to take control
  - See next chapter on affinity control
- Leveraging topology effects is a part of code optimization!



# **Programming models**

- MPI + OpenMP

# Topology and affinity on multicore

General considerations

How to compile, link, and run

Hands-on: Hello hybrid!

System topology, ccNUMA, and memory bandwidth Memory placement on ccNUMA systems

> Topology and affinity on multicore

Hands-on: Pinning

Case study: The Multi-Zone NAS Parallel Benchmarks

Hands-on: Masteronly hybrid Jacobi

Overlapping communication and computation

Communication overlap with OpenMP taskloops

Hands-on: Taskloop-based hybrid Jacobi

Main advantages, disadvantages, conclusions

# Thread/Process Affinity ("Pinning")

- Highly OS-dependent system calls
  - But available on all OSs
  - Non-portable
- Support for user-defined pinning for OpenMP threads in all compilers
  - Compiler specific
  - Standardized in OpenMP (places)
  - Generic Linux: taskset, numactl, likwid-pin
- Affinity awareness in all MPI libraries
  - Not defined by the MPI standard (as of 4.0)
  - Necessarily non-portable feature of the startup mechanism (mpirun, ...)
- Affinity awareness in batch scheduler
  - Batch scheduler must work with MPI + OpenMP affinity
  - Difficult, non-portable, every combination is different

# Anarchy vs. affinity with OpenMP STREAM





There are several reasons for caring about affinity:

- Eliminating performance variation
- Making use of architectural features
- Avoiding resource contention



#### OMP PLACES and Thread Affinity (see OpenMP-4.0 page 7 lines 29-32, p. 241-243)

A place consists of one or more processors.

processor is the smallest unit to run a thread or task

Free migration of the threads on a place between the *processors* of that place.

- OMP PLACES=threads
- abstract name
- → Each place corresponds to the single *processor* of a single hardware thread (hyper-thread)
- OMP PLACES=cores

Pinning on the level of *places*.

- → Each place corresponds to the processors (one or more hardware threads) of a single core
- OMP PLACES=sockets
  - → Each place corresponds to the processors of a single socket (consisting of all hardware threads of one or more cores)

lower-bound>:<number of entries>[:<stride>

- OMP PLACES=abstract name(num places)
  - → In general, the number of places may be explicitly defined
- Or with explicit numbering, e.g. 8 places, each consisting of 4 processors:
  - setenv OMP\_PLACES "{0,1,2,3},{4,5,6,7},{8,9,10,11}, ... {28,29,
  - setenv OMP PLACES "{0:4},{4:4},{8:4}, ... {28:4}"
  - setenv OMP PLACES "{0:4}:8:4"

#### CAUTION:

The numbers highly depend on hardware and operating system, e.g.,

- {0.1} = hyper-threads of 1st core of 1st socket, or  $\{0,1\} = 1^{st}$  hyper-thread of  $1^{st}$  core
- of 1st and 2nd socket, or ...

# OMP\_PROC\_BIND variable / proc\_bind() clause

#### Determines how places are used for pinning:

| Used for         | OMP_PROC_BIND | Meaning                                                                                                     |
|------------------|---------------|-------------------------------------------------------------------------------------------------------------|
|                  | FALSE         | Affinity disabled                                                                                           |
|                  | TRUE          | Affinity enabled, implementation defined strategy                                                           |
|                  | CLOSE         | Threads bind to consecutive places                                                                          |
|                  | SPREAD        | Threads are evenly scattered among places                                                                   |
|                  | MASTER        | Threads bind to the same place as the master thread that was running before the parallel region was entered |
| nested<br>OpenMP |               |                                                                                                             |

# Some simple OMP\_PLACES examples

Intel Xeon w/ SMT, 2x36 cores, 1 thread per physical core, fill 1 socket

```
OMP_NUM_THREADS=36
OMP_PLACES=cores
OMP_PROC_BIND=close
```

Intel Xeon Phi with 72 cores,
 32 cores to be used, 2 threads per physical core

```
OMP_NUM_THREADS=64
OMP_PLACES=cores(32)
OMP_PROC_BIND=close  # spread will also do
```

Intel Xeon, 2 sockets, 4 threads per socket (no binding within socket!)

```
OMP_NUM_THREADS=8
OMP_PLACES=sockets
OMP_PROC_BIND=close  # spread will also do
```

Intel Xeon, 2 sockets, 4 threads per socket, binding to cores

```
OMP_NUM_THREADS=8
OMP_PLACES=cores
OMP_PROC_BIND=spread
```

Always prefer abstract places instead of HW thread IDs!

# Pinning of MPI processes

- Highly system dependent!
- Intel MPI: env variable I\_MPI\_PIN\_DOMAIN
- OpenMPI: choose between several mpirun options, e.g.,
   -bind-to-core, -bind-to-socket, -bycore, -byslot ...
- Cray's aprun: pinning by default

 Platform-independent tools: likwid-mpirun (likwid-pin, numactl)

# Anarchy vs. affinity with a heat equation solver



#### Reasons for caring about affinity:

- Eliminating performance variation
- Making use of architectural features
- Avoiding resource contention



2x 10-core Intel Ivy Bridge, OpenMPI



#### likwid-mpirun: 1 MPI process per node

likwid-mpirun -np 2 -pin N:0-11 ./a.out



Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Intel MPI+compiler:

#### likwid-mpirun: 1 MPI process per socket



# MPI/OpenMP affinity: Take-home messages

- Learn how to take control of hybrid execution!
  - Almost all performance features depend on topology and thread placement! (especially if SMT/Hyperthreading is on)
- Always observe the topology dependence of
  - Intranode MPI performance
  - OpenMP overheads
  - Saturation effects / scalability behavior with bandwidth-bound code
- Enforce proper thread/process to core binding, using appropriate tools
   (→ whatever you use, but use SOMETHING)
- Memory page placement on ccNUMA nodes
  - Automatic optimal page placement for one (or more) MPI processes per ccNUMA domain (solution A)
  - Explicitly parallel first-touch initialization only required for multi-domain MPI processes (solution B)

# Programming models

- MPI + OpenMP

Hands-On #2

**Pinning** 

General considerations

How to compile, link, and run

Hands-on: Hello hybrid!

System topology, ccNUMA, and memory bandwidth Memory placement on ccNUMA systems

Topology and affinity on multicore

> Hands-on: Pinning

Case study: The Multi-Zone NAS Parallel Benchmarks

Hands-on: Masteronly hybrid Jacobi

Overlapping communication and computation

Communication overlap with OpenMP taskloops

Hands-on: Taskloop-based hybrid Jacobi

Main advantages, disadvantages, conclusions

### Hands-On #1

he-hy - Hello Hybrid! - pinning

5. MPI-pure MPI: compile and run the MPI "Hello world!" program (pinning)

6. MPI+OpenMP:: :TODO: compile and run the Hybrid "Hello world!" program

7. MPI+OpenMP: :TODO: how to do pinning

→ see: TODO.README

# Programming models

- MPI + OpenMP

Hands-On #3

**Masteronly hybrid Jacobi** 

General considerations

How to compile, link, and run

Hands-on: Hello hybrid!

System topology, ccNUMA, and memory bandwidth

Memory placement on ccNUMA systems

Topology and affinity on multicore

Hands-on: Pinning

Case study: The Multi-Zone NAS Parallel Benchmarks

> Hands-on: Masteronly hybrid Jacobi

Overlapping communication and computation

Communication overlap with OpenMP taskloops

Hands-on: Taskloop-based hybrid Jacobi

Main advantages, disadvantages, conclusions

## Example: MPI+OpenMP-Hybrid Jacobi solver

- Source code: See <a href="http://tiny.cc/MPIX-VSC">http://tiny.cc/MPIX-VSC</a>
- This is a Jacobi solver (2D stencil code) with domain decomposition and halo exchange
- The given code is MPI-only. You can build it with make (take a look at the Makefile) and run it with something like this (adapt to local requirements):

```
$ <mpirun-or-whatever> -np <numprocs> ./jacobi.exe < input</pre>
```

Task: parallelize it with OpenMP to get a hybrid MPI+OpenMP code, and run it effectively on the given hardware.

- Notes:
  - The code is strongly memory bound at the problem size set in the input file
  - Learn how to take control of affinity with MPI and especially with MPI+OpenMP
  - Always run multiple times and observe performance variations
  - If you know how, try to calculate the maximum possible performance and use it as a "light speed" baseline

http://tiny.cc/MPIX-VSC

# Example cont'd

- Tasks (we assume N<sub>c</sub> cores per CPU socket):
  - Run the MPI-only code on one node with 1,...,N<sub>c</sub>,...,2\*N<sub>c</sub> processes (1 full node) and observe the achieved performance behavior
  - Parallelize appropriate loops with OpenMP
  - Run with OpenMP and 1 MPI process ("OpenMP-only") on 1,...,N<sub>c</sub>,...,2\*N<sub>c</sub> cores, compare with MPI-only run
  - Run hybrid variants with different MPI vs. OpenMP ratios
- Things to observe
  - Run-to-run performance variations
  - Does the OpenMP/hybrid code perform as well as the MPI code? If it doesn't, fix it!



http://tiny.cc/MPIX-VSC

# Programming models

- MPI + OpenMP

# Overlapping Communication and Computation

General considerations

How to compile, link, and run

Hands-on: Hello hybrid!

System topology, ccNUMA, and memory bandwidth

Memory placement on ccNUMA systems

Topology and affinity on multicore

Hands-on: Pinning

Case study: The Multi-Zone NAS Parallel Benchmarks

Hands-on: Masteronly hybrid Jacobi

> Overlapping communication and computation

Communication overlap with OpenMP taskloops Hands-on: Taskloop-based hybrid Jacobi

Main advantages, disadvantages, conclusions

# Sleeping threads with masteronly style

```
for (iteration ....)
{
    #pragma omp parallel
       numerical code
    /* end parallel */

    /* on master only */
      MPI_Send(halos);
      MPI_Recv(halos);
} /*end for loop*/
```



#### Problem:

Sleeping threads are wasting CPU time

#### Solution:

- Overlapping of computation and communication
- Limited benefit:
  - Best case: reduces communication overhead from 50% to 0%
    - $\rightarrow$  speedup of 2x
  - Usual case of 20% to 0%
    - $\rightarrow$  speedup of 1.25x
  - Requires significant work → later

## Nonblocking vs. threading for overlapped comm.

- Why not use nonblocking calls?
  - Asynchronous progress not guaranteed
  - Options (implementation dependent):
    - Communication offload to NIC
    - Additional internal progress thread (MPI\_ASYNC... with MPICH)
  - Intranode and internode communication may be handled very differently
- Using threading for communication overlap
  - One or more threads/tasks handles communication, rest of team "do the work"
  - How to organize the work sharing among all threads?
    - Non-communicating threads
    - Communicating threads after communication is over
  - Not all of the work can usually be overlapped → see next slide

# Using threading/tasking for comm. overlap



#### Explicit overlapping of communication and computation

The basic principle appears simple:

```
#pragma omp parallel
 // ... do other parallel work
 if (thread ID < 1) {
   MPI Send/Recv ... // comm. halo data
  } else {
   // Work on data that is independent
   // of halo data
} // end omp parallel
// Now work on data that needs the
// halo data (all threads)
```

## Overlapping communication with computation

#### Three problems:

- Application problem: separate application into
  - code that can run before the halo data is received
  - code that needs halo data
  - May be hard to do
- Thread-rank problem: distinguish comm. / comp. via thread ID
  - Work sharing and load balancing is harder
  - Options
    - Fully manual work distribution
    - Nested parallelism
    - Tasking & taskloops
    - Partitioned comm (MPI-4.0)
- Optimal memory placement on ccNUMA may be difficult

# Programming models - MPI + OpenMP

# Communication overlap with OpenMP taskloops

General considerations

How to compile, link, and run

Hands-on: Hello hybrid!

System topology, ccNUMA, and memory bandwidth Memory placement on ccNUMA systems

Topology and affinity on multicore

Hands-on: Pinning

Case study: The Multi-Zone NAS Parallel Benchmarks

Hands-on: Masteronly hybrid Jacobi

Overlapping communication and computation

> Communication overlap with OpenMP taskloops

Hands-on: Taskloop-based hybrid Jacobi

Main advantages, disadvantages, conclusions

# OpenMP taskloop Directive - Syntax

- Immediately following loop executed in several tasks
- Not a work-sharing directive!
  - Should be executed only by one thread!

A task can be run by any thread, across NUMA nodes

→ ② perfect first touch impossible!

Fortran:

```
!$OMP taskloop [clause[[,]clause]...]
    do_loop
[!$OMP end taskloop [nowait]]
```

Loop iterations must be independent, i.e., they can be executed in parallel

- If used, the end do directive must appear immediately after the end of the loop
- " C/C++:
   #pragma omp taskloop [clause[[,]clause]...] new-line
   for-loop
  - The corresponding for-loop must have canonical shape → next slide

# OpenMP taskloop Directive - Details

```
clause can be one of the following:
  • if([taskloop:]scalar-expr)
                                                          [a task clause]
  shared (list)
                                                          [a task clause]
  private (list), firstprivate (list)
                                            [a do/for clause] [a task clause]
  lastprivate(list)
                                            [a do/for clause]
  default(shared | none | ...)
                                                          [a task clause]
  collapse(n)
                                            [a do/for clause]
  ■ grainsize (grain-size) 		 Mutually
                                 exclusive
  num tasks(num-tasks)
  untied, mergeable
                                                          [a task clause]
   final(scalar-expr), priority(priority-value)
                                                          [a task clause]
   nogroup
                                                                          Since
                                                                       OpenMP 5.0!
  ■ reduction (operator:list) ←
                                            [a do/for clause]
do/ for clauses that are not valid on a taskloop:
  schedule(type[,chunk]), nowait
  • linear(list[: linear-step]), ordered [(n)]
```

# OpenMP single & taskloop Directives

```
C/C++
```

```
C / C++:
```

```
#pragma omp parallel

{

#pragma omp single

{

A lot more tasks than threads may be produced to achieve a good load balancing

}

/*omp end single*/

} /*omp end parallel*/
```



#### Comm. overlap with task & taskloop Directives - C/C++

```
#pragma omp parallel
C/C++
             #pragma omp single
               #pragma omp task
                { // MPI halo communication:
                    MPI Send/Recv...
                 // numerical loop using halo data:
Number of
tasks may
                 #pragma omp taskloop
                 for (i=0; i<100; i++)
   be
                    a[i] = b[i] + b[i-1] + b[i+1] + b[i-2]...;
influenced
               } /*omp end of halo task */
   with
grainsize or
num tasks
               // numerical loop without halo data:
 clauses
               #pragma omp taskloop
               for (i=100; i<10000; i++)
                 a[i] = b[i] + b[i-1] + b[i+1] + b[i-2]...;
             } /*omp end single */
           } /*omp end parallel*/
```



#### Partitioned Point-to-Point Communication

- New in MPI-4.0:
   Partitioned communication is "partitioned" because it allows for multiple contributions of data to be made, potentially, from multiple actors (e.g., threads or tasks) in an MPI process to a single communication operation.
- A point-to-point operation (i.e., send or receive)
  - can be split into partitions,
  - and each partition is filled and then "sent" with MPI\_Pready by a thread;
  - same for receiving
- Technically provided as a new form of persistent communication.

# **Programming models**

- MPI + OpenMP

Hands-On #4

Taskloop-based hybrid Jacobi

→ optional...

General considerations

How to compile, link, and run

Hands-on: Hello hybrid!

System topology, ccNUMA, and memory bandwidth

Memory placement on ccNUMA systems

Topology and affinity on multicore

Hands-on: Pinning

Case study: The Multi-Zone NAS Parallel Benchmarks

Hands-on: Masteronly hybrid Jacobi

Overlapping communication and computation

Communication overlap with OpenMP taskloops

> Hands-on: Taskloop-based hybrid Jacobi

Main advantages, disadvantages, conclusions

# Programming models - MPI + OpenMP

Main advantages, disadvantages, conclusions

General considerations

How to compile, link, and run

Hands-on: Hello hybrid!

System topology, ccNUMA, and memory bandwidth

Memory placement on ccNUMA systems

Topology and affinity on multicore

Hands-on: Pinning

Case study: The Multi-Zone NAS Parallel Benchmarks

Hands-on: Masteronly hybrid Jacobi

Overlapping communication and computation

Communication overlap with OpenMP taskloops

Hands-on: Taskloop-based hybrid Jacobi

> Main advantages, disadvantages, conclusions

### Load Balancing with hybrid programming

- On same or different level of parallelism
- OpenMP enables
  - cheap dynamic and guided load-balancing
  - via a parallelization option (clause on omp for / do directive)
  - without additional software effort
  - without explicit data movement
- On MPI level
  - Dynamic load balancing requires moving of parts of the data structure through the network
  - Significant runtime overhead
  - Complicated software → rarely implemented
- MPI & OpenMP
  - Simple static load balancing on MPI level, dvnamic or guided on OpenMP level
     medium-quality, cheap implementation

```
#pragma omp parallel for schedule(dynamic)
for (i=0; i<n; i++) {
  /* poorly balanced iterations */ ...
```

### MPI+OpenMP: Main advantages

- Increase parallelism
  - Scaling to higher number of cores
  - Adding OpenMP with incremental additional parallelization
- Lower memory requirements due to smaller number of MPI processes
  - Reduced amount of application halos & replicated data
  - Reduced size of MPI internal buffer space
  - Very important on systems with many cores per node
- Lower communication overhead (possibly)
  - Few multithreaded MPI processes vs many single-threaded processes
  - Fewer number of calls and smaller amount of data communicated
  - Topology problems from pure MPI are solved (was application topology versus multilevel hardware topology)
- Provide for flexible load-balancing on coarse and fine levels
  - Smaller #of MPI processes leave room for assigning workload more evenly
  - MPI processes with higher workload could employ more threads

#### Additional advantages when overlapping communication and computation:

No sleeping threads

#### MPI+OpenMP: Main disadvantages & challenges

- Non-Uniform Memory Access:
  - Not all memory access is equal: ccNUMA locality effects
  - Penalties for access across NUMA domain boundaries
  - First touch is needed for more than one NUMA domain per MPI process
  - Alternative solution:
     One MPI process on each NUMA domain (i.e., chip)
- Multicore / multisocket anisotropy effects
  - Bandwidth bottlenecks, shared caches
  - Intra-node MPI performance: Core ↔ core vs. socket ↔ socket
  - OpenMP loop overhead
- Amdahl's law on both, MPI and OpenMP level
- Complex thread and process pinning

Masteronly style (i.e., MPI outside of parallel regions)

Sleeping threads

Additional disadvantages when overlapping communication and computation:

- High programming overhead
- OpenMP is only partially prepared for this programming style → taskloop directive

#### Questions addressed in this tutorial

- What is the performance impact of system topology?

   How do I map my programming model on the system to my advantage?
   How do I do the split into MPI+X?
   Where do my processes/threads run? How do I take control?

   Where is my data?

   How can I minimize communication overhead?

   CCNUMA first-touch placement
- How does hybrid programming help with typical HPC problems?
  - Can it reduce communication overhead?
  - Can it reduce replicated data?
- How can I leverage multiple accelerators?
  - What are typical challenges?

### Conclusions

# Major advantages of hybrid MPI+OpenMP

In principle, none of the programming models perfectly fits to clusters of SMP nodes

#### Major advantages of MPI+OpenMP:

- Only one level of sub-domain "surface-optimization":
  - SMP nodes, or
  - Sockets or NUMA domains
- Second level of parallelization
  - Application may scale to more cores
- Smaller number of MPI processes implies:
  - Reduced size of MPI internal buffer space
  - Reduced space for replicated user-data

Most important arguments on many-core systems

#### Major advantages of hybrid MPI+OpenMP, continued

#### Reduced communication overhead

- No intra-node communication
- Longer messages between nodes and fewer parallel links may imply better bandwidth

- "Cheap" load-balancing methods on OpenMP level
  - Application developer can split the load-balancing issues between coursegrained MPI and fine-grained OpenMP

# Disadvantages of MPI+OpenMP

- Using OpenMP
  - → may prohibit compiler optimization
  - → may cause significant loss of computational performance
- Thread fork / join overhead
- On ccNUMA SMP nodes:
  - Loss of performance due to missing memory page locality or missing first touch strategy
  - E.g., with the MASTERONLY scheme:
    - One thread produces data
    - Master thread sends the data with MPI
    - → data may be internally communicated from one NUMA domain to the other one
- Amdahl's law for each level of parallelism
- Using MPI-parallel application libraries? → Are they prepared for hybrid?
- Using thread-local application libraries? → Are they thread-safe?

#### MPI+OpenMP versus MPI+MPI-3.0 shared memory

#### MPI+3.0 shared memory

- Pro: Thread-safety is not needed for libraries.
- Con: No work-sharing support as with OpenMP directives.
- Pro: Replicated data can be reduced to one copy per node:
   May be helpful to save memory, if pure MPI scales in time, but not in memory
- Substituting intra-node communication by shared memory loads or stores has only limited benefit (and only on some systems), especially if the communication time is dominated by inter-node communication
- Con: No reduction of MPI ranks
   → no reduction of MPI internal buffer space
- Con: Virtual addresses of a shared memory window may be different in each MPI process
  - → no binary pointers
  - → i.e., linked lists must be stored with offsets rather than pointers

#### Conclusions

- Future hardware will be more complicated
  - Heterogeneous → GPU, FPGA, ...
  - Node-level ccNUMA is here to stay, but will only be one of your problems
- High-end programming → more complex → many pitfalls
- Medium number of cores → more simple (#cores / SMP-node still grows)
- MPI + OpenMP → workhorse on large systems
  - Major pros: reduced memory needs and second level of parallelism
- MPI + MPI shared memory → only for special cases and medium #processes
- Pure MPI communication → still viable if it does the job
- OpenMP only → on large ccNUMA nodes (almost gone in HPC)



# **Programming models**

- MPI + Accelerator

# General considerations 88

OpenMP offloading 95

Advantages & main challenges 106

#### Accelerator programming: Bottlenecks reloaded

Example: 2-socket Intel "Ice Lake" (2x36 cores) node with two NVIDIA A100 GPGPUs (PCIe 4)

|                                                                             | per GPGPU                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      | per CPU                            |  |  |  |
|-----------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------|--|--|--|
| DP peak<br>performance<br>Machine balance<br>eff. memory (HBM)<br>bandwidth | 9.7 Tflop/s (100 Pinch 100 | 2.3 Tflop/s  0.10 B/F  170 Gbyte/s |  |  |  |
| inter-device<br>bandwidth (PCIe)                                            | ≈ 30 (                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         | Gbyte/s                            |  |  |  |
| inter-device<br>bandwidth (NVlink)                                          | > 500                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | > 500 Gbyte/s                      |  |  |  |

→ Speedups can only be attained if communication overheads are under control

→ Basic estimates help



#### Accelerator + MPI: How does the data get from A to B?



#### DEVANA's Multi-GPU nodes: nvidia-smi tool

| NVID                       | IA-SMI | 525.8 |             | _       | Version:        |        |                   |                      |          |                                     |
|----------------------------|--------|-------|-------------|---------|-----------------|--------|-------------------|----------------------|----------|-------------------------------------|
|                            |        |       | Persis      | tence-M | Bus-Id          |        | Disp.A<br>y-Usage | Volatil<br>  GPU-Uti | le t     | Jncorr. ECC<br>Compute M.<br>MIG M. |
| 0<br>N/A                   |        |       |             |         | 0000000<br>  0M |        | 0.0 Off           | <br>  09<br>         | )<br>5   | 0<br>Default<br>Disabled            |
| 1<br>N/A                   |        |       |             |         | 0000000<br>  0M |        |                   | i                    | <u>}</u> | 0<br>Default<br>Disabled            |
| 2<br>N/A                   |        |       |             |         | 0000000<br>  0M |        |                   | 09<br>               | )<br>5   | 0<br>Default<br>Disabled            |
|                            | 29C    | P0    | 50 <b>w</b> | / 400W  | 0000000<br>  0M | iB / 4 | 0960MiB           | <br>  09             | )<br>6   | 0<br>Default<br>Disabled            |
| Processes:                 |        |       |             |         |                 |        |                   |                      |          |                                     |
| GPU                        |        | CI    | F           | ID Ty   | pe Proc         | ess na | me                |                      |          | GPU Memory<br>Usage                 |
| No running processes found |        |       |             |         |                 |        |                   |                      |          |                                     |

### DEVANA's Multi-GPU nodes: topology and i/connect

```
trainer2@n141 ~ > nvidia-smi topo -m
         GPU0 GPU1 GPU2 GPU3 NIC0 NIC1 NIC2 CPU Affinity NUMA Affinity
GPU0
             NV4 NV4
                      NV4 NODE NODE NODE
                                             0 - 31
GPU1
                                             0-31
         NV4
              Х
                  NV4 NV4 NODE NODE NODE
                                             32-63
GPU2
         NV4 NV4
                   Х
                       NV4
                           SYS
                                SYS
                                    SYS
         NV4 NV4 NV4 X
                                             32-63
GPU3
                           SYS SYS
                                   SYS
NIC0
         NODE NODE SYS SYS
                            X
                                NODE NODE
NIC1
                       SYS
         NODE NODE SYS
                          NODE
                                Х
                                     PIX
NIC2
         NODE NODE SYS SYS NODE PIX X
```

#### Legend:

#### Questions to ask

- Is the MPI implementation CUDA aware?
  - Yes: Can use device pointers in MPI calls
  - No: Explicit DtoH/HtoD buffer transfers required
  - Copying to consecutive halo buffers may still be necessary
- Is NVLink available?
  - Yes: Direct GPU-GPU MPI communication with MPI
    - Supported by: P100, V100, A100, H100
  - No: copies via host (even with NVIDIA GPUDirect)
- Unified Memory or explicit DtoH/HtoD transfers?
  - UM: Transparent sharing of host and device memory
- Actual bandwidths and latencies?
  - Highly system and implementation dependent!



# Options for hybrid accelerator programming

| multicore host                         |
|----------------------------------------|
| MPI                                    |
| MPI+MPI3 shmem ext.                    |
| MPI+threading (OpenMP, pthreads, TBB,) |
| threading only                         |
| PGAS (CAF, UPC,)                       |
|                                        |

| accelerator     |
|-----------------|
| CUDA, HIP       |
| OpenCL          |
| OpenACC         |
| OpenMP 4.0++    |
| special purpose |
|                 |

Which model/combination is the best?

→ the one that allows you to address the relevant hardware bottleneck(s)

# **Programming models**

- MPI + Accelerator

**General considerations** 

OpenMP offloading

Advantages & main challenges 106

# What is OpenMP offloading?

- "Everybody knows OpenMP"
- API that supports offloading of loops and regions of code (e.g. loops) from a host CPU to an attached accelerator in C, C++, and Fortran
- Set of compiler directives, run-time routines, and environment variables
- Simple programming model for using accelerators (focus on GPGPUs)
- Memory model:
  - Host CPU + Device may have completely separate memory; Data movement between host and device performed by host via runtime calls; Memory on device may not support memory coherence between execution units or need to be supported by explicit barrier
- Execution model:
  - Compute intensive code regions offloaded to the device, executed as kernels; Host orchestrates data movement, initiates computation, waits for completion; Support for multiple levels of parallelism, including SIMD

#### A very simple OpenMP example (nvc 23.1-0): Vector Triad

```
int main ()
   double* restrict a = malloc(nsize * sizeof(double));
   double* restrict b = malloc(nsize * sizeof(double));
   double* restrict c = malloc(nsize * sizeof(double));
   double* restrict d = malloc(nsize * sizeof(double));
#pragma omp target enter data map(to:a[0:nsize], b[0:nsize], c[0:nsize])
  compute(a ,b , c ,d ,N);
void compute (double *restrict a , double *b,...) {
#pragma omp target teams distribute\
                                                   nvc -q -O3 -mp=qpu -gpu=managed -Minfo -c triad.F90
                          parallel for simd
                                                       17, #omp target teams distribute parallel for simd
                                                           17, Generating "nvkernel main F1L17 2" GPU kernel
  for(int i=0; i<N; ++i) {
                                                          19, Loop parallelized across teams and threads(128),
    a[i] = b[i] + c[i] * d[i];
                                                   schedule(static)
                                                       17, Generating target enter data map(to:
                                                   c[:nsize],b[:nsize],a[:nsize])
                                                       25, #omp target teams distribute parallel for simd
                                                           25, Generating "nvkernel main F1L25 4" GPU kernel
                                                           28, Loop parallelized across teams and threads(128),
                                                   schedule(static)
                                                       38, Generating target exit data map(from:
                                                   c[:nsize],b[:nsize],a[:nsize])
```

# Example: 2D Laplace equation

#### We want to solve this:

$$\begin{split} &\partial_{xx}u(x,y)+\partial_{yy}u(x,y)=0,\\ &u(x,y)\in[0,1]\times[0,1]\setminus\partial\Omega \end{split}$$

#### subject to the boundary conditions:

$$u(x,0) = u(x,1) = x$$
$$u(0,y) = 0$$

$$u(1, y) = 1$$

#### numerically, using finite differences:

$$\left(\partial_{xx}u(x,y)\right)_{ij}\approx\frac{u_{i+1,j}-2u_{ij}+u_{i-1,j}}{\Delta x^2}.$$

#### Converged solution:



# Example: Fortran 2D Jacobi solver offloading

#### Basic step:

```
allocate(a(0:ni+1,0:nj+1), b(0:ni+1,0:nj+1))
!$omp target enter data map(to:a(0:ni+1,0:nj+1), b(0:ni+1,0:nj+1))
!$omp target teams distribute parallel do
do j = 1, nj
  do i = 1, ni
    b(i,j) = (a(i,j-1) + a(i,j+1) + a(i-1,j) + a(i+1,j)) / 4d0
  end do
end do
end do
call swap(b,a)
```

#### And check for the convergence:

```
error = 0d0
!$omp target teams distribute parallel do simd reduction(max:error)
do j = 1, nj
  do i = 1, ni
    error = max(error, abs(a(i,j)-b(i,j)))
  end do
end do
```

# Example: multi-GPU offloading with MPI; one node

Typical MPI 1D domain decomposition: distribute **a** and **b** over MPI ranks

```
allocate(a(0:ni+1,s-1:e+1), b(0:ni+1,s-1:e+1))
!$omp target enter data map(to:a(0:ni+1,s-1:e+1), b(0:ni+1,s-1:e+1))
!$omp target teams distribute parallel do
do j = s, e
   do i = 1, ni
        b(i,j) = (a(i,j-1) + a(i,j+1) + a(i-1,j) + a(i+1,j)) / 4d0
   end do
end do
call swap(b,a)
```

### Example: multi-GPU offloading with MPI; one node

Typical MPI 1D domain decomposition: distribute **a** and **b** over MPI ranks and send the rank's portion of the data to the corresponding GPU

```
gpuid = mpirank
allocate(a(0:ni+1,s-1:e+1), b(0:ni+1,s-1:e+1))
!$omp target enter data map(to:a(0:ni+1,s-1:e+1), b(0:ni+1,s-1:e+1)) device(gpuid)
!$omp target teams distribute parallel do device(gpuid)
do j = s, e
   do i = 1, ni
        b(i,j) = (a(i,j-1) + a(i,j+1) + a(i-1,j) + a(i+1,j)) / 4d0
   end do
end do
```

### Example: multi-GPU offloading with MPI; one node

#### Exchange halos (MPI\_SENDRECV or whatever you like):

```
call MPI CART CREATE (MPI COMM WORLD, 1, [mpisize], [.false.], .true.,
comm1d, mpierr)
call MPI COMM RANK(commld, mpirank, mpierr)
call MPI CART SHIFT (comm1d, 0, 1, left, right, mpierr)
call MPI SENDRECV(
               a(1,e), nx, MPI DOUBLE PRECISION, right, 0, &
               a(1,s-1), nx, MPI DOUBLE PRECISION, left, 0, &
               commld, MPI STATUS IGNORE, ierr)
call MPI SENDRECV(
               a(1,s), nx, MPI DOUBLE PRECISION, left, 1, &
               a(1,e+1), nx, MPI DOUBLE PRECISION, right, 1,&
               comm1d, MPI STATUS IGNORE, ierr)
```

102/110

# Example: multi-GPU offloading with MPI; multi-node

Each compute node sees only its own GPUs (4 on DEVANA). We split the communicator further to get node's local ranks:

#### Job submission on multi-GPU clusters

```
trainer2@login02 ~ > cat onenode.sh
#!/bin/bash
#BATCH --time=00:05:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=1
#SBATCH --partition=ngpu
#SBATCH --job-name=mpiompgpu onenode
#SBATCH --err=mpiompgpu onenode.err
#SBATCH --out=mpiompgpu onenode.out
#SBATCH --gres=gpu:4
module load nvhpc/23.1 GCC/11.3.0
mpirun -np 4 ./jacobi mpi gpu
```

```
trainer2@login02 ~ > cat twonodes.sh
#!/bin/bash
#SBATCH --time=00:05:00
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=1
#SBATCH --partition=ngpu
#SBATCH --job-name=mpiompgpu twonodes
#SBATCH --err=mpiompgpu twonodes.err
#SBATCH --out=mpiompgpu twonodes.out
#SBATCH --gres=gpu:4
module load nvhpc/23.1 GCC/11.3.0
mpirun -np 8 ./jacobi mpi gpu
```

# Example: multi-GPU multi-node benchmarking

A word of caution: sometimes we have to run the benchmark for some time, discarding timings of the first half of iterations.

Benchmarking 2D Laplace, 9600<sup>2</sup> points on DEVANA (4 A100 per node):

| N GPUs | Execution time, s |
|--------|-------------------|
| 1      | 12.81             |
| 2      | 6.78              |
| 4      | 4.01              |
| 8      | 2.71              |

# Programming models

- MPI + Accelerator

**General considerations** 88

OpenMP offloading 95

Advantages & main challenges 106

### MPI+Accelerators: Main advantages

- Hybrid MPI/OpenMP can leverage accelerators and yield performance increase over pure MPI on multicore
- Compiler/pragma-based API provides relatively easy way to use coprocessors
- OpenMP 4.0/4.5/5.1 extensions provide flexibility to use a wide range of heterogeneous co-processors (GPU, APU, heterogeneous many-core types)

### MPI+Accelerators: Main challenges

- Considerable implementation effort for basic usage, depending on complexity of the application
- Efficient usage of pragmas requires good understanding of performance issues
  - Performance is not only about code; data structures can be decisive as well
- Support for accelerator pragmas still restricted to certain environments
  - NVIDIA GPUs have best support

#### Questions addressed in this tutorial

- What is the performance impact of system topology?
- How do I map my programming model on the system to my advantage?
  - How do I do the split into MPI+X?
  - Where do my processes/threads run? How do I take control?
  - Where is my data?
  - How can I minimize communication overhead?
- How does hybrid programming help with typical HPC problems?
  - Can it reduce communication overhead?
  - Can it reduce replicated data?
- How can I leverage multiple accelerators?
  - What are typical challenges?

Data structures are decisive, inter-device communication support varies



# Thank you for your interest!

TREX Workshop: Code Tuning for the Exacale @ Bratislava, June 5, 2023