# Diagonally-Addressed Matrix Nicknack: Sparse Matrix Vector (SpMV) Product

This data set contains measurements of run times computing the SpMV product on an Intel Skylake [Xeon Silver 4110][wikichip]
for matrices stored in the Compressed Sparse Rows (CSR) format as well as the Diagonally-Addressed CSR (DA-CSR) format.
All matrices have been taken from the [SuiteSparse Matrix Collection].
The following implementations have been measured.

* [Eigen] version 3.4.0 for CSR
* Intel [MKL] version 2021.1 for CSR
* Several custom (DAMN) implementations for CSR and DA-CSR

See [10.5281/zenodo.8104335] for the source codes that were used to generate this data set.

[wikichip]: https://en.wikichip.org/wiki/intel/xeon_silver/4110
[Eigen]: https://eigen.tuxfamily.org/
[MKL]: https://en.wikipedia.org/wiki/Math_Kernel_Library
[SuiteSparse Matrix Collection]: https://sparse.tamu.edu/
[10.5281/zenodo.8104335]: https://doi.org/10.5281/zenodo.8104335 

## Exploring the Data and Regenerating Publication Figures

Install [Julia] version 1.8.5 or newer and
execute the following to generate some figures describing the data set.

```bash
julia --project RUNME.jl
```

Alternatively, explore the data by running

```bash
julia --project -e 'import Pkg; Pkg.instantiate()'
julia --project -e 'using Pluto; Pluto.run()'
```

and opening `RUNME.jl` the [Pluto] server.

[Julia]: https://julialang.org/
[Pluto]: https://plutojl.org/

## Contents and Directory Structure

This data set contains measurements for several configurations `CONFIG`.

* `no-omp`: Benchmarks compiled without [OpenMP] and linked against sequential [MKL]
* `omp2`, ..., `omp8`: Benchmarks compiled with [OpenMP], linked against GNU-threaded [MKL], and executed with 2-8 threads

In any case, the threads had been pinned to the first cores of the processor using `taskset`.
Hyper-Threading has been switched off.
Each configuration contains the following files and directories.

* `spmv/`: matrix-specific SpMV benchmarks (see below)
* `damn-*.out` and `damn-*.err`: execution logs to stdout and stderr, respectively
* `info.yaml`: hardware, build, and environment information (see below)
* `matrices_nnz*.txt`: list of matrices measured; one `group/name` identifier per line
* `prod.csv` and `triad.csv`: hardware-specific benchmarks (see below)

### General Info

The `${CONFIG}/info.yaml` contains some information on the executing hardware,
build configuration, and run time environment.

Schema:

* `gitcommit`: Git commit SHA
* `hostname`: host name (as reported by `hostname`)
* `cpu`: CPU name (as reported by `/proc/cpuinfo`)
* `mem`: Main memory in GiB (derived from `/proc/meminfo`)
* `cache`: CPU cache sizes (as reported by `getconf`)
* `env`: `DAMN`- and `OMP`-related environment variables, and `SLURM_JOB_ID`
* `build`: CMake build information
* `comment`: some remarks
* `energy_kWh`: total energy consumption in kWh (as reported by [Slurm])

[Slurm]: https://slurm.schedmd.com/

### SpMV Benchmark

The SpMV product describes the operation

```math
y \gets \alpha A x + \beta y
```

where $A$ denotes a matrix, $x$ and $y$ denote vectors,
and $\alpha$ and $\beta$ scalars.
$y$ is being updated in-place.
This operation has been measured for for several combinations of data types,
which are encoded in the name of the resulting data file `spmv_A??_x??_y??_nnz??.csv`.

* `32`-bit or `64`-bit scalars for `A`, `x`, and `y` (e.g. `A32` or `A64`)
* `16`-bit or `32`-bit row pointers for `A` (suffix `nnz16` or `nnz32`)

The scalar type of `y` determines the work precision as well.

#### Example

```
${CONFIG}/spmv/Oberwolfach/rail_1357/spmv_A32_x64_y64_nnz16.csv
```

* Benchmarks for `Oberwolfach/rail_1357` matrix
* Matrix entries stored as `float`: `A32`
* Vector entries of `x` and `y` stored as `double`: `x64` and `y64`
* All computations performed using `double`: `y64`
* Matrix `rowptr` indices stored as `int16_t`: `nnz16`

#### Schema

* `A_name`: `group/name` identifier of the matrix within the [SuiteSparse Matrix Collection]
* `A_format`: CSR or DA-CSR
* `A_bandwidth`: matrix bandwidth (after Reverse Cuthill-McKee reordering, if applied)
* `A_oindex`: int16 or int32; type used for the matrix row pointers ("outer indices")
* `A_iindex`: int8, int16, or int32; type used for the matrix column indices ("inner indices")
* `A_scalar`: float32 or float64; type used for the matrix entries
* `X_scalar`: float32 or float64; type used for the entries of `x`
* `Y_scalar`: float32 or float64; type used for the entries of `y`
* `impl_vendor`: DAMN, Eigen, or MKL; implementation vendor
* `impl_desc`: implementation description for DAMN, library version for MKL and Eigen
* `impl_nacc`: number of accumulator variables
* `impl_simd_size`: number of scalars comprising a single accumulator variable
* `impl_simd_arch`: description of SIMD architecture, e.g. fma3+avx2 (as reported by [xsimd])
* `traffic_B`: traffic imposed by `A`, `x`, and `y`
* `min_elapsed_s`: minimum elapsed time in seconds, see `t*_s` for individual measurements
* `med_elapsed_s`: median elapsed time in seconds, see `t*_s` for individual measurements
* `err%`: Median Absolute Percentage Error (MdAPE) of elapsed times (as reported by [nanobench])
* `cpucycles`: CPU cycles (as reported by [nanobench])
* `instructions`: CPU instructions executed (as reported by [nanobench])
* `t1_s`, ..., `t11_s`: (average) elapsed time of epochs 1-11 in seconds

[Eigen]: https://eigen.tuxfamily.org/
[MKL]: https://en.wikipedia.org/wiki/Math_Kernel_Library
[SuiteSparse Matrix Collection]: https://sparse.tamu.edu/
[xsimd]: https://xsimd.readthedocs.io/
[nanobench]: https://nanobench.ankerl.com/

### Triad and Prod Benchmarks

These benchmarks measure the memory bandwidth of the executing hardware.
We had already implemented these benchmarks before knowing about [likwid].

* `prod`: an element-wise vector product, `a[i] = b[i] * c[i]`
* `triad`: an implementation similar to [likwid]'s `triad` benchmark, `a[i] = b[i] * c[i] + d[i]`

The results will be stored at the following locations.

```
${CONFIG}/prod.csv
${CONFIG}/triad.csv
```

Schema:

* `description`: naive or [xsimd]
* `traffic_h`: human-readable traffic bin, e.g. "4 KiB"
* `traffic_GiB`: traffic in GiB
* `min_elapsed_s`: minimum elapsed time in seconds, see `t*_s` for individual measurements
* `med_elapsed_s`: median elapsed time in seconds, see `t*_s` for individual measurements
* `err%`: Median Absolute Percentage Error (MdAPE) of elapsed times (as reported by [nanobench])
* `cpucycles`: CPU cycles (as reported by [nanobench])
* `instructions`: CPU instructions executed (as reported by [nanobench])
* `t1_s`, ..., `t11_s`: (average) elapsed time of epochs 1-11 in seconds

[likwid]: https://hpc.fau.de/research/tools/likwid/
[OpenMP]: https://www.openmp.org/

# Appendix: Remarks on Oberwolfach/t2dah

According to the logs, the MKL benchmarks using 32-bit scalars failed for three matrices,
irrespective of the presence or number of threads used with OpenMP.
These benchmarks had been launched on the compute nodes listed in the following table,
and re-run on the same nodes while excluding the MKL case.

| Config | t2dah_a | t2dah_e | t2dah |
|---|--:|--:|--:|
| no-omp | node020 | node053 | node007 |
| omp2 | 074 | 067 | 069 |
| omp4 | 058 | 014 | 057 |
| omp6 | 101 | 102 | 102 |
| omp8 | 065 | 074 | 065 |

It follows a brief description, how this information was obtained,
taking the "no-omp" case as an example.
According to `no-omp/damn-401436.out`, three MKL benchmarks failed with the following error message.
Note that the path refers to the location within [10.5281/zenodo.8104335].

```
path/to/damn/bench/spmv/mkl-csr.cpp:12: FATAL ERROR: test case CRASHED: SIGSEGV - Segmentation violation signal
```

The corresponding matrices can be identified using `sacct` or by inspecting `damn-401436.err`.

```
$ cd no-omp
$ sacct -j 401436 --format JobName%40,State | grep FAILED
               Oberwolfach/t2dah_a/nnz32     FAILED 
               Oberwolfach/t2dah_e/nnz32     FAILED 
                 Oberwolfach/t2dah/nnz32     FAILED 
```

```
$ grep failed damn-401436.err
level=error message="benchmark failed or crashed" matrix=Oberwolfach/t2dah_a benchmark=spmv_A32_x32_y32_nnz32
level=error message="benchmark failed or crashed" matrix=Oberwolfach/t2dah_e benchmark=spmv_A32_x32_y32_nnz32
level=error message="benchmark failed or crashed" matrix=Oberwolfach/t2dah benchmark=spmv_A32_x32_y32_nnz32
```

The benchmarks for these matrices ran on the following compute nodes.

```
$ grep t2dah damn-401436.err | grep starting
level=info message="starting benchmark" hostname=node020 matrix=Oberwolfach/t2dah_a nnz_bits=32
level=info message="starting benchmark" hostname=node053 matrix=Oberwolfach/t2dah_e nnz_bits=32
level=info message="starting benchmark" hostname=node007 matrix=Oberwolfach/t2dah nnz_bits=32
```

These benchmarks have been repeated on the same nodes while excluding the MKL case.
Only the corresponding `spmv_A32_x32_y32_nnz32.csv` has been copied into the present data set.

```bash
DAMN_MATRIX=Oberwolfach/t2dah path/to/tools/benchmark_matrix.sh -tce=MKL*
```
