## HPC codesign in GROMACS

### Szilárd Páll pszilard@kth.se

Workshop on Software Co-Design Actions in European Flagship HPC Codes ISC 2022 June 2, 2022









### **BioExcel Center of Excellence**



Goals:

- Develop key applications (incl. GROMACS) for exascale;
- Develop workflow solutions
- Training/support to academia and industry
- Establish a long-term organizational structure

Shared under CC BY-SA 4.0. doi.org/10.5281/zenodo.6620848

# FAST. FLEXIBLE. FREE.

### Classical MD code

- supports all major force-fields
- broad algorithm support

### • Development:

Stockholm Sweden & partners worldwide

### • Large user base:

- One of the top HPC codes worldwide
- deployed on most clusters
- 10k's academic & industry users
- Open source: LGPLv2
- Open development:
  - code review & bug-tracker:https://gitlab.com/gromacs



units cells





Eighth shell domain decomposition





virtual interaction sites



Triclinic unit cell with load balancing and staggered cell boundaries

### Focus on high performance:

efficient algorithms & highly-tuned parallel code

### Bottom-up performance oriented design:

- absolute performance over "just scaling"
- Focus on portability
  - Linux distro integration and CI
  - regular testing on all HPC arch
  - SIMD portability library, GPU abstraction layer
  - open standards-based languages/APIs
- Modern development workflow
  - mandatory open code review for >10 years
  - tiered CI testing / verification





arbitrary units cells



Eighth shell domain decomposition



virtual interaction sites



Triclinic unit cell with load balancing and staggered cell boundaries

## MD: computational challenge

Pair-search step every 50-200 iterations



~ millisecond or less

- Simulation vs real-world time-scale gap
  - Every simulation: 10<sup>8</sup> –10<sup>15</sup> steps
  - Every step: 10<sup>6</sup> 10<sup>9</sup> FLOPs
- Main goal of parallelization:
  - study molecular systems: tackle the time- or length-scale challenge
  - typically requires: **strong scaling**, increasingly **ensemble**
- MD codes at peak: ~**100 μs / step** (on commodity hardware)
  - <100 atoms/core at peak</p>
  - <10000 atoms / GPU

## Multiple levels of hardware parallelism



| - | <br> | <br> |
|---|------|------|
|   |      | <br> |
| _ | <br> | <br> |





**Compute cluster or cloud** Networked computers: topology, bandwidth, latency







**Compute node / workstation** 

NUMA topology, PCIe Shared under CC BY-SA 4.0. doi.org/10.5281/zenodo.6620848



up to 512-bit v => up to 16 singl





Multicore CPU & manycore GPU caches, interconnects

### up to 512-bit vector units/core

### up to 16 single precision ops/clock/

## Multiple levels of hardware parallelism Multiple levels of parallelization



• Mapping the problem to the hardware:

**expose parallelism** (algorithms) & **express parallelism** (implementation)

 Need to choose the right: granularity & abstraction (problem & hardware-specific)

Shared under CC BY-SA 4.0. doi.org/10.5281/zenodo.6620848

## HPC nodes today/soon



### JUWELS-Booster: 2 CPU + 4 GPU w NVlink + 4 NIC

## Multiple levels of **hardware parallelism** Multiple levels of **parallelization**

### Exascale challenge:

Public Cloud

### Increasing parallelism

- → need to express more concurrency
- Increasing complexity (interconnects, memories, NUMA)

**GPUs** 

**CPUs** 

- → tackle using runtimes or in application?
- Increasing diversity
  - → zoo of programming models
  - → algorithms, portability/testing, performance portability
- Heterogeneity is here to stay
  - ignore or embrace?
  - Wait for integration or tune for many generations?

**Compute cluster or cloud** Networked computers: topology, bandwidth, latency

PD

### Compute node / workstation

Multicore CPU + manycore GPU caches, interconnects

NUMA topology, PCIe Shared under CC BY-SA 4.0. doi.org/10.5281/zenodo.6620848



## **GROMACS** parallelization

- Multi-level hierarchical parallelization: target each level of hardware parallelism individually
  - Intra-node:
    - OpenMP multi-threading
      - -static loop schedule, cache optimized work decomposition layout, sparse reductions
    - SIMD C++ library abstraction:
      - -14 flavors supported
    - GPU abstraction layer
      - CUDA, OpenCL, SYCL
    - thread-MPI: pthreads-based MPI for ease of use
  - Inter-node:
    - MPI: SPMD / MPMD
    - Dynamic load balancing, task balancing







## Why codesign?

- interdisciplinarity challenge
  - $\rightarrow$  many hard problems need cross-disciplinary solutions
- MD: need for performance
  - GROMACS: design focus
- portability
  - GROMACS design focus
- hardware evolution...

## **GROMACS & codesign**

- Petascale → Exascale
  - required algorithm & parallelization redesign
  - Codesign has been & remains core component
- Physics / math + algorithms + HW
  - mainly intra-team/community
  - innovate (reformulate algorithms, accuracy-based algorithms)
  - enable (domain experts method dev, CS experts micro-bench / port)
- Algorithms + HW + vendors / CS-experts
  - mainly inter-team collaboration
  - align goals for collaboration so benefits both ways!
  - Long-term: many steps forward and several major successes

## Algorithm redesign for modern architectures

potential o N

Short-

0

### **Cluster pair-interaction** algorithm for SIMD/SIMT



4x4 setup on SIMD-16



**Accuracy-based automated list** buffer improves SIMD algorithm parallel efficiency



### **Dual pair list with** dynamic pruning



### **Multi-level heterogeneous data** and task load-balancing: intra-GPU, intra-node, inter-node





## Embracing heterogeneity

- Heterogeneous design at the core:
  - "somewhat" complex schedule.
    - → "But there is also always some reason in madness."
    - Heterogeneity for performance &

**flexibility:** think of the (sometimes) silent codesign partners, method devs





## Dual pair list

- Trading costly data regularization for force computation not ideal!
- Instead: keep regularized particle data longer, shift the cost trade-off
- Use two buffers and lists:
  outer / inner
- Periodically re-prune
  outer → inner
- List lifetime / search frequency:
  - outer list less frequently (costly)
  - inner list more frequently (cheap)



# Accuracy-based balancing: **dual pair list** reducing decomposition & search cost

Pair-search step every 20-100 iterations







## Vendor-collaboration codesign: long-term practice Change in CUDA runtime API overhead



## **Direct GPU communication**

- Alan Gray & Gaurav Garg (NVIDIA)
- Goal:
  - avoid CPU staging, accelerate critical path
  - target intra-node interconnects, e.g. NVLInk
- Two flavors:
  - thread-MPI: single-node (since 2021)
    - P2P copies (put/get), exchange CUDA events allows remote sync
    - Single process + multiple GPUs: bottlenecks required CUDA driver threading optimizations
  - CUDA-aware MPI: multi-node (since 2022)
    - requires host sync before issuing MPI call







### Multi-GPU/rank force offload scheme



## Multi-GPU/rank GPU-resident scheme



## Multi-GPU/rank GPU-resident scheme



# Multi-node GPU resident & direct GPU communication



## Multi-GPU resident step: single-node P2P direct GPU comm

- The entire inner loop including communication can be enqueued ahead of time
  - if there is no CPU task (Other F)
  - enables more efficient scheduling
  - overlap launch cost with work
  - CUDA graphs
- Challenges:
  - integrating CPU tasks
  - load balancing



# Multi-node GPU resident step & GPU-aware MPI comm



## Direct GPU communication performance

- Major benefit on fast interconnects with GPU-resident steps •
- Modest improvements on low-end interconnects



Shared under CC BY-SA 4.0. doi.org/10.5281/zenodo.6620848

## DD halo exchange peak strong scaling

- JUWELS-booster:
  - 2x24-core AMD EPYC Rome
  - 4xA100
- ~50% parallel efficiency up to 12 nodes
  - only ~20000 atoms/GPU
- Peak at 48-64 nodes:

>500 ns/day





## DD halo exchange peak strong scaling

- JUWELS-booster:
  - 2x24-core AMD EPYC Rome
  - 4xA100



ылгец инцег сс Бт-ъл 4.0. doi.org/10.5281/zenodo.6620848



(1M atom STMV)

600

500



## PME decomposition

### GROMACS team + Gaurav Garg (NVIDIA)

- remove the limitation of single dedicated PME GPU
- 3D FFTs strong-scaling challenge:

typical size 32<sup>3</sup>-256<sup>3</sup>, hardly scale

• Released in 2022:

Hybrid mode: FFT on CPU

- In development (upstreamed):
  - major algorithmic and parallelization optimizations





## Direct GPU communication with PME decomposition

- Major benefit on fast interconnects with GPU-resident steps
- Modest improvements on low-end interconnects



snared under UL BY-SA 4.0. doi.org/10.5281/zenodo.6620848

### PME scaling improvements with cuFFTmp (in development)



- Strong scales reasonably well to 16-24 nodes:
  - only 10-15k (!) atoms per GPU
  - further improvements planned
- Peak can still be lower than CPU-only machines
  - algorithm improvements needed
  - next-gen hardware expected to help



### **Codesign project with NVIDIA**

## Asynchronous scheduling: CUDA graphs



### **Codesign project with NVIDIA**

### **Codesign project with NVIDIA**

## Multi-GPU graph scheduling



## Asynchronous scheduling: CUDA graphs

 multi-rank: leverage thread-MPI using pthreads for UVA direct async copies



Shared under CC BY-SA 4.0. doi.org/10.5281/zenodo.6620848

### **Codesign project with NVIDIA**

| 3<br>EMCPY                                                                                                                                                                                                                                                                        |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| J,574104)                                                                                                                                                                                                                                                                         |
|                                                                                                                                                                                                                                                                                   |
|                                                                                                                                                                                                                                                                                   |
|                                                                                                                                                                                                                                                                                   |
|                                                                                                                                                                                                                                                                                   |
|                                                                                                                                                                                                                                                                                   |
|                                                                                                                                                                                                                                                                                   |
|                                                                                                                                                                                                                                                                                   |
|                                                                                                                                                                                                                                                                                   |
|                                                                                                                                                                                                                                                                                   |
|                                                                                                                                                                                                                                                                                   |
|                                                                                                                                                                                                                                                                                   |
|                                                                                                                                                                                                                                                                                   |
|                                                                                                                                                                                                                                                                                   |
|                                                                                                                                                                                                                                                                                   |
|                                                                                                                                                                                                                                                                                   |
|                                                                                                                                                                                                                                                                                   |
|                                                                                                                                                                                                                                                                                   |
|                                                                                                                                                                                                                                                                                   |
|                                                                                                                                                                                                                                                                                   |
|                                                                                                                                                                                                                                                                                   |
|                                                                                                                                                                                                                                                                                   |
|                                                                                                                                                                                                                                                                                   |
|                                                                                                                                                                                                                                                                                   |
|                                                                                                                                                                                                                                                                                   |
|                                                                                                                                                                                                                                                                                   |
| 8                                                                                                                                                                                                                                                                                 |
| 28nbnxn_gpu_x_to_nbat_x_kerneliP6float4PK6float3PKiS5_S5_ii                                                                                                                                                                                                                       |
|                                                                                                                                                                                                                                                                                   |
|                                                                                                                                                                                                                                                                                   |
|                                                                                                                                                                                                                                                                                   |
|                                                                                                                                                                                                                                                                                   |
|                                                                                                                                                                                                                                                                                   |
| 9                                                                                                                                                                                                                                                                                 |
| nxn_kernel_tiecewQS1ab_vdwLJComoLB_r_cuda15NBA00mLataOpu10NBrarantOpuN5NonXm9gpu_pinsteb                                                                                                                                                                                          |
|                                                                                                                                                                                                                                                                                   |
|                                                                                                                                                                                                                                                                                   |
|                                                                                                                                                                                                                                                                                   |
|                                                                                                                                                                                                                                                                                   |
|                                                                                                                                                                                                                                                                                   |
|                                                                                                                                                                                                                                                                                   |
|                                                                                                                                                                                                                                                                                   |
|                                                                                                                                                                                                                                                                                   |
|                                                                                                                                                                                                                                                                                   |
|                                                                                                                                                                                                                                                                                   |
| ,                                                                                                                                                                                                                                                                                 |
|                                                                                                                                                                                                                                                                                   |
|                                                                                                                                                                                                                                                                                   |
|                                                                                                                                                                                                                                                                                   |
|                                                                                                                                                                                                                                                                                   |
|                                                                                                                                                                                                                                                                                   |
|                                                                                                                                                                                                                                                                                   |
|                                                                                                                                                                                                                                                                                   |
|                                                                                                                                                                                                                                                                                   |
|                                                                                                                                                                                                                                                                                   |
|                                                                                                                                                                                                                                                                                   |
|                                                                                                                                                                                                                                                                                   |
| 45<br>NS 11MantaniSakhhlueDEFE NS 10ManisSationThueDEFED.00Ma=194 ( c1 Mer) Merer Mere                                                                                                                                                                                            |
| 45<br>NS_18NumTempScak ValuesE0ELXS_19/elocityScalingTypeE0EEEviPMourSS4_S4_PKS3_PKES8_PKS3_                                                                                                                                                                                      |
| 45<br>NS_18NumTempScak ValuesE0EEXNS_10VelocityScaling(TypeE0EEExi)EddoarSS4_S4_JKS3_JKES8_PKS3_                                                                                                                                                                                  |
| 45<br>NS_18NumTempScale ValuesE0ELXN_19VebocityScalingTypeE0EEEsviP6doartS4_S4_PKS3_PKdS8_PKcS3_                                                                                                                                                                                  |
| 45<br>NS_183emf7empScaleValuesE0EELNS_109VelocityScalingTypeE0EEExi0PdfootS44_S4_PKS3_PKES8_PKs3_                                                                                                                                                                                 |
| 45<br>NS_18NemTempScale ValuesE0ELNS_10VidexityScaling TypeE0EEEsi0Fedout.S44_S4_PKS3_PKE38_PKE33_<br>48<br>xt2lincs_kernelILh1ELb0EEENS3_241.uso.Gpuk/ernelParametersEPK60bsd3192_S5_f                                                                                           |
| 45<br>NS_18NumTempScak ValuesE0ELXS_19VelocityScalingTypeE0EEExiPMthuatSk4_S4_PKS1_PKENS_PKS3_<br>NS_18NumTempScak ValuesE0ELXS_19VelocityScalingTypeE0EEExiPMthuatSk4_S4_PKS1_PKENS_PKS3_<br>48<br>At 21incs_LernelILx1ELb0EEExNS_224LincsGpaKernelParametersEPK6dbat3PS2_S5_f   |
| 45<br>NS_18NumTempScak ValuesE0ELNS_19VebOcityScalingTypeE0EEE+vi96doar1S4_S4_PKS3_JK#S8_PKKS3_<br>48<br>x12lines_kernedILh1ELb0EEEnNS_24LinesOpuKernetParametersEPK6doad3932_S5_f                                                                                                |
| 45<br>NS_183emTempScakValuesE0ELNS_19VelocityScalingTypeE0EEEviP660arS44_54_JKS3_JKES8_JKS3_<br>42<br>At 25mos_kernelTLb1ELb0EEENS_24LinesCpacKernelParametersEPK660arSPS2_S5_f                                                                                                   |
| 45<br>NS_183amTempScale ValuesE0ELXS_10VelocityScalingTypeE0EEEviP6floar3S4_54_PKS3_PKES3_<br>48<br>At 21incs_lemoILb1ELb0EEEviPS_24LincsGopeKernelParametersEPK6floar3PS2_S5_f<br>49<br>emoILb1ELb0EEEviPK11VuuersMeeLNNS_16SettleParametersEPK6floar0PS3_S8_PT/Pte-kinc         |
| 45<br>NS_18NamTempScak ValuesE0ELXS_19VelocityScalingTypeE0EEEsviPdfoatX84_S4_PKS3_PKdS8_PKdS3_<br>48<br>x12lines_kernelILb1ELb0EEEsviPK3_24LinesCipuKernelParametersEPK6foat3PS2_S5_f<br>49<br>ernelILb1ELb0EEEsviPK13WateAbsloculeXS_165ctlsParametersEPK6foadPS5_58_PT7PbeAiac |

### Generated with cudaGraphDebugDotPrint()

## Multi-GPU graph scheduling



Shared under CC BY-SA 4.0. doi.org/10.5281/zenodo.6620848

### **Codesign project with NVIDIA**

## CUDA graph scheduling performance



Shared under CC BY-SA 4.0. doi.org/10.5281/zenodo.6620848



## SYCL for portability & performance

- SeRC & Intel: OneAPI CoE: Andrey Alekseenko (KTH)
- 1st GPU backend with DPC++
  - early prototype released in GROMACS 2021
- added hipSYCL support as portability check first
- starting with the 2022 release: GROMACS adopted **hipSYCL for production AMD support**
- SYCL to replace OpenCL as portability GPU backend
  - already broader feature set coverage
  - broad vendor support: AMD, NVIDIA, Intel

Shared under CC BY-SA 4.0. doi.org/10.5281/zenodo.6620848

### **Codesign project with Intel**

### **Codesign project with Intel**

### SYCL in GROMACS relative to native

**Intel DG1** Whole application, oneAPI/OpenCL (PME) Sum of kernels, oneAPI/OpenCL (PME) Whole application, oneAPI/Level0 Sum of kernels, oneAPI/Level0 Run time r 60  $10^{4}$ 10<sup>5</sup> System size

Non-bonded F Non-bonded FV NB list pruning PME Spread PME Solve oneAPI/OpenCL PME Gather oneAPI/Level0 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00

PME electrostatics, 384k system size

Kernel time, relative to OpenCL (less is faster)

native OpenCL



and rel GPU kernels perf

PME electrostatics, 384k system size



Shared under CC BY-SA 4.0. doi.org/10.5281/zenodo.6620848

### hipSYCL vs native HIP on ROCm 4.5.2

### oneAPI/DPC++ on OpenCL/L0 vs (oneAPI 2022.0 except L0 2021.4)

## Take-aways

- Codesign key for algorithm reformulation / redesign
  - enabled to keep up with hardware evolution
  - need to be forward-looking
  - disruptive vs constructive
- Long-term investment
- Interdisciplinarity helps but challenges too
- Collaboration needs long-term alignment
  - much easier intra-team/community
  - harder and often challenging cross-team
- Plan for the progress but allow for the incidental collaboration
  - accommodating SW design will allow new domain-science contribution

### ion bution

### Acknowledgments GROMACS

Andrey Alekseenko Artem Zhmurov **Berk Hess** Erik Lindahl Magnus Lundborg **Paul Bauer** 

Mark Abraham (Intel) Roland Schulz (Intel)

Alan Gray (NVIDIA) Gaurav Garg (NVIDIA)









