



Friedrich-Alexander-Universität Erlangen-Nürnberg

## "Simple" performance modeling: The Roofline Model

### Loop-based performance modeling: Execution vs. data transfer

R.W. Hockney and I.J. Curington:  $f_{1/2}$ : A parameter to characterize memory and communication bottlenecks. Parallel Computing 10, 277-286 (1989). DOI: 10.1016/0167-8191(89)90100-2

W. Schönauer: <u>Scientific Supercomputing: Architecture and Use of Shared and Distributed Memory Parallel Computers</u>. Self-edition (2000)

S. Williams: <u>Auto-tuning Performance on Multicore Computers</u>. UCB Technical Report No. UCB/EECS-2008-164. PhD thesis (2008)

### Analytic white-box performance models

An analytic white-box performance model is a simplified mathematical description of the hardware and its interaction with software. It is able to predict the runtime/performance of code from "first principles."

### A simple performance model for loops



#### **Roofline Model**

## Naïve Roofline Model

How fast can tasks be processed? **P** [flop/s]

The bottleneck is either

- The execution of work:
- The data path:

 $P_{\text{peak}}$  [flop/s]  $I \cdot b_S$  [flop/byte x byte/s]



## The Roofline Model in computing – Basics

#### Apply the naive Roofline model in practice

- Machine parameter #1:
- Machine parameter #2:
- Code characteristic:



# Prerequisites for the Roofline Model

- Data transfer and core execution overlap perfectly!
  - Either the limit is core execution or it is data transfer
- Slowest limiting factor "wins"; all others are assumed to have no impact
  - If two bottlenecks are "close," no interaction is assumed
- Data access latency is ignored, i.e. perfect streaming mode
  Achievable bandwidth is the limit
- Chip must be able to saturate the bandwidth bottleneck(s)
  Always model the full chip







### Roofline for architecture and code comparison

With Roofline, we can

- Compare capabilities of different machines
- Compare performance expectations for different loops

- Roofline always provides upper bound but is it realistic?
  - Simple case: Loop kernel has loop-carried dependecncies → cannot achieve peak
  - Other bandwidth bottlenecks may apply



### A refined Roofline Model

- 1.  $P_{\text{max}}$  = Applicable peak performance of a loop, assuming that data comes from the level 1 cache (this is not necessarily  $P_{\text{peak}}$ )  $\rightarrow$  e.g.,  $P_{\text{max}}$  = 176 GFlop/s
- 2.  $b_{\rm S}$  = Applicable (saturated) peak bandwidth of the slowest data path utilized  $\rightarrow$  e.g.,  $b_{\rm S}$  = 56 GByte/s
- 3. *I* = Computational intensity ("work" per byte transferred) over the slowest data path utilized (code balance  $B_C = I^{-1}$ ) → e.g., *I* = 0.167 Flop/Byte →  $B_C = 6$  Byte/Flop

rformance limit: 
$$P = \min(P_{\max}, I \cdot b_S) = \min\left(P_{\max}, \frac{b_S}{B_C}\right)$$
 [Byte/Flop]

Pe

[Byte/s]

#### Multiple bottlenecks:

- Decode/retirement throughput
- Port contention
   (direct or indirect)
- Arithmetic pipeline stalls (dependencies)
- Overall pipeline stalls (branching)
- L1 Dcache bandwidth (LD/ST throughput)
- Scalar vs. SIMD execution
- L1 Icache (LD/ST) bandwidth
- Alignment issues

1.5k entry µOP-Cache 5-way Decoder Micro-code Sequencer Front End 6 µOPs 4 µOPs 🕨 5 µOPs Max 6 µOPs Integer: 16 (180 physical) Register file ReOrder Buffer (224 Entries) Vector: 32 (168 physical) Scheduler Port 1 Port 5 Port 6 Port 2 Port 3 Port 4 Port 7 Port 0 Ш R μΟΡ μΟΡ μΟΡ μΟΡ μΟΡ μΟΡ μΟΡ μΟΡ 00 **Execution Units** 27 units total Fused AVX512 Load Buffer (72 Entries) Store Buffer (56 Entries Maximum throughput 4 µOPs/cy 64 B/cy 64 B/cy 64 B/cy L1 Cache

Skylake

Tool for *P*<sub>max</sub> analysis: OSACA <u>http://tiny.cc/OSACA</u> DOI: <u>10.1109/PMBS49563.2019.00006</u> DOI: <u>10.1109/PMBS.2018.8641578</u>

. . .

### Bandwidth-bound (simple case)

- 1. Accurate traffic calculation (writeallocate, strided access, ...)
- 2. Practical  $\neq$  theoretical BW limits
- Saturation effects → consider full socket only

### Core-bound (may be complex)

- 1. Multiple bottlenecks: LD/ST, arithmetic, pipelines, SIMD, execution ports
- 2. Limit is linear in # of cores





## Refined Roofline model: graphical representation

### Multiple ceilings may apply

- Different bandwidths / data paths
   → different inclined ceilings
- Different P<sub>max</sub>
   → different flat ceilings

In fact,  $P_{max}$  should always come from code analysis; generic ceilings are usually impossible to attain



#### Hardware features of (some) Intel Xeon processors

| Microarchitecture       | Ivy Bridge EP          | Broadwell EP           | Cascade Lake SP                    | Ice Lake SP                      |
|-------------------------|------------------------|------------------------|------------------------------------|----------------------------------|
| Introduced              | 09/2013                | 03/2016                | 04/2019                            | 06/2021                          |
| Cores                   | ≤ 12                   | ≤ 22                   | ≤ 28                               | ≤ 40                             |
| LD/ST throughput per cy | /:                     |                        |                                    |                                  |
| AVX(2), AVX512          | 1 LD + ½ ST            |                        |                                    |                                  |
| SSE/scalar              | 2 LD    1 LD & 1 ST    | 2 LD + 1 ST            | 2 LD + 1 ST                        | 2 LD + 1 ST                      |
| ADD throughput          | 1 / cy                 | 1 / cy                 | 2 / cy                             | 2 / cy                           |
| MUL throughput          | 1 / cy                 | 2 / cy                 | 2 / cy                             | 2 / cy                           |
| FMA throughput          | N/A                    | 2 / cy                 | 2 / cy                             | 2 / cy                           |
| L1-L2 data bus          | 32 B/cy                | 64 B/cy                | 64 B/cy                            | 64 B/cy                          |
| L2-L3 data bus          | 32 B/cy                | 32 B/cy                | 16+16 B/cy                         | 16+16 B/cy                       |
| L1/L2 per core          | 32 KiB / 256 KiB       | 32 KiB / 256 KiB       | 32 KiB / 1 MiB                     | 48 KiB / 1.25 MiB                |
| LLC                     | 2.5 MiB/core inclusive | 2.5 MiB/core inclusive | 1.375 MiB/core<br>exclusive/victim | 1.5 MiB/core<br>exclusive/victim |
| Memory                  | 4ch DDR3               | 4ch DDR3               | 6ch DDR4                           | 8ch DDR4                         |
| Memory BW (meas.)       | ~ 48 GB/s              | ~ 62 GB/s              | ~ 115 GB/s                         | ~ 160 GB/s                       |

<u>manual.html</u>



Example: P<sub>max</sub> of vector triad on Haswell/Broadwell

```
double *A, *B, *C, *D;
for (int i=0; i<N; i++) {
    A[i] = B[i] + C[i] * D[i];
}
```

Assembly code (AVX2+FMA, no additional unrolling):

```
..B2.9:

vmovupd ymm2, [rdx+rax*8] # LOAD

vmovupd ymm1, [r12+rax*8] # LOAD

vfmadd213pd ymm1, ymm2, [rbx+rax*8] # LOAD+FMA

vmovupd [rdi+rax*8], ymm2 # STORE

add rax,4

cmp rax,r11

jb ..B2.9

# remainder loop omitted
```

Best-case execution time?

assumption justified!

Iterations are

throughput

independent  $\rightarrow$ 

```
double *A, *B, *C, *D;
for (int i=0; i<N; i++) {
    A[i] = B[i] + C[i] * D[i];
}
```

Minimum number of cycles to process one AVX-vectorized iteration (equivalent to 4 scalar iterations) on one core?

 $\rightarrow$  Assuming full throughput:

```
Cycle 1: LOAD + LOAD + STORE
Cycle 2: LOAD + LOAD + FMA + FMA
Cycle 3: LOAD + LOAD + STORE Answer: 1.5 cycles
```

### Example: P<sub>max</sub> of vector triad on Haswell @ 2.3 GHz



Vector triad A(:)=B(:)+C(:)\*D(:) on a 2.3 GHz 14-core Haswell chip

Consider full chip (14 cores):

Memory bandwidth:  $b_{\rm S} = 50$  GB/s Code balance (incl. write allocate):  $B_{\rm c} = (4+1)$  Words / 2 Flops = 20 B/F  $\rightarrow$  / = 0.05 F/B

 $\rightarrow$  *I* · *b*<sub>S</sub> = 2.5 GF/s (0.5% of peak performance)

 $P_{\text{peak}}$  / core = 36.8 Gflop/s ((8+8) Flops/cy x 2.3 GHz)  $P_{\text{max}}$  / core = 12.27 Gflop/s (see prev. slide)

 $\rightarrow P_{\text{max}} = 14 * 12.27 \text{ Gflop/s} = 172 \text{ Gflop/s} (33\% \text{ peak})$ 

 $P = \min(P_{\max}, I \cdot b_S) = \min(172, 2.5) \text{ GFlop/s} = 2.5 \text{ GFlop/s}$ 

#### Code balance: more examples



#### A not so simple Roofline example

Example: do i=1,N; s=s+a(i); enddo

in single precision on an 8-core 2.2 GHz Sandy Bridge socket @ "large" N





- Hit the BW bottleneck by good serial code (e.g., Ninja C++ → Fortran)
- 2. Increase intensity to make better use of BW bottleneck (e.g., spatial loop blocking)
- 3. Increase intensity and go from memory bound to core bound (e.g., temporal blocking)
- 4. Hit the core bottleneck by good serial code (e.g., -fno-alias, SIMD intrinsics)







Friedrich-Alexander-Universität Erlangen-Nürnberg

# Diagnostic / phenomenological Roofline modeling



# **Diagnostic modeling**

- What if we cannot predict the intensity/balance?
  - Code very complicated
  - Code not available
  - Parameters unknown
  - Doubts about correctness of analysis
- Measure data volume V<sub>meas</sub> (and work N<sub>meas</sub>)
  - Hardware performance counters
  - Tools: likwid-perfctr, PAPI, Intel Vtune,...
- Insights + benefits
  - Compare analytic model and measurement  $\rightarrow$  validate model
  - Can be applied (semi-)automatically
  - Useful in performace monitoring of user jobs on clusters





# Roofline and performance monitoring of clusters



https://github.com/RRZE-HPC/likwid/wiki/Tutorial%3A-Empirical-Roofline-Model

## **Roofline conclusion**

- Roofline = simple first-principle model for upper performance limit of datastreaming loops
  - Machine model  $(P_{max}, b_S)$  + application model (I)
  - Conditions apply, extensions exist
- Two modes of operation
  - Predictive: Calculate *I*, calculate upper limit, validate model, optimize, iterate
  - Diagnostic: Measure I and P, compare with roof
- Challenge of predictive modeling: Getting P<sub>max</sub> and I right