# Performance-portable solid mechanics via matrix-free p-multigrid

Welcome, Reproducibility Reviewer. We appreciate your work. This image is
available via:

    $ docker pull jedbrown/jedbrown/perf-portable-solids:cuda

You will find our software stack built for Haswell and CUDA sm_80 (Ampere
generation) GPUs. Please let us know if you would like an image built for a
different architecture. Feel free to inspect the included Dockerfile to see how
this image was created.

The software developed for this paper resides in three open source packages, all
of which have documentation sites.

* [PETSc](https://petsc.org) version 3.17, [on GitLab](https://gitlab.com/petsc/petsc)
* [libCEED](https://libceed.org) version 0.10.1, [on GitHub](https://github.com/CEED/libCEED)
* [Ratel](https://ratel.micromorph.org) version 0.1.1, [on GitLab](https://gitlab.com/micromorph/ratel) 

These three packages were built in this image so that you can just run. Note
that several external packages were installed using PETSc's `--download-*`
feature. This includes a recent version of Open MPI configured to provide
GPU-aware MPI.

## Running the solver

You can run a small experiment using 

```console
$ mpiexec -n 2 ratel-quasistatic -dm_plex_shape schwarz_p -dm_plex_tps_thickness .2 -dm_plex_tps_extent 2,2,2 -dm_plex_tps_layers 2 -dm_plex_tps_refine 2 -material fs-current-nh -E 1 -nu .3 -bc_clamp 1 -bc_traction 2 -bc_traction_2 .02,0,0 -order 2 -snes_monitor -ksp_converged_reason -dm_view -ksp_rtol 1e-3 -snes_converged_reason -mg_levels_ksp_max_it 2 -ksp_view_singularvalues -ceed /gpu/cuda
```

The `-dm_plex_tps_extent` argument controls problem size. We ran experiments up
to `-dm_plex_tps_extent 40,40,40` on OLCF Crusher (1.5 GDoF to 1.9 GDoF). The
`-order` argument controls the polynomial order of the finite element basis. See
Section IV.B for discussion of suitable parameters to meet accuracy criteria
with different order bases.

The `-ceed /gpu/cuda` is the only parameter necessary to use GPUs. You may
delete this parameter to use CPUs exclusively.

Most of the reported numerical experiments used BoomerAMG from hypre as the
coarse solver. This needs an additional suite of run-time options. Note that
hypre configured for GPUs cannot run purely on CPUs, so this configuration will
not work if `-ceed /gpu/cuda` is elided.

```console
$ mpiexec -n 2 ratel-quasistatic -dm_plex_shape schwarz_p -dm_plex_tps_thickness .2 -dm_plex_tps_extent 2,2,2 -dm_plex_tps_layers 2 -dm_plex_tps_refine 2 -material fs-current-nh -E 1 -nu .3 -bc_clamp 1 -bc_traction 2 -bc_traction_2 .02,0,0 -order 2 -snes_monitor -ksp_converged_reason -dm_view -ksp_rtol 1e-3 -snes_converged_reason -mg_levels_ksp_max_it 2 -ksp_view_singularvalues -ceed /gpu/cuda -mg_coarse_pc_type hypre -mg_coarse_pc_hypre_boomeramg_coarsen_type pmis -mg_coarse_pc_hypre_boomeramg_interp_type ext+i -mg_coarse_pc_hypre_boomeramg_no_CF -mg_coarse_pc_hypre_boomeramg_P_max 6 -mg_coarse_pc_hypre_boomeramg_relax_type_down Chebyshev -mg_coarse_pc_hypre_boomeramg_relax_type_up Chebyshev -mg_coarse_pc_hypre_boomeramg_strong_threshold 0.5
```

GPU-aware MPI can be disabled by passing `-use_gpu_aware_mpi 0`, which will
cause packed buffers to be copied into host memory before calling MPI.

Diagnostic output (containing fields such as von Mises stress) can be generated
using `-view_diagnostics` (one file when the solve completes) or
`-ts_monitor_diagnostic_quantities_vtk`, which creates a time series. Although
the above doesn't need it, stronger loading requires time stepping, in which
case `-ts_dt 0.25` can be used (four loading steps to final time of 1). The
diagnostic files are `diagnostic*.vtu` files are ready to open using Paraview.
If you wish to warp the structure, use the `Calculator` filter to create an output
called `Displacement` with expression

    "diagnostic_quantities.displacement_x"*iHat + "diagnostic_quantities.displacement_y"*jHat + "diagnostic_quantities.displacement_z"*kHat

then apply the `Warp By Vector` filter, and color by von Mises stress as in
Figure 6.

Note that it is also possible to run on meshes created using Gmsh (and various
other mesh generators). Some examples, including those used in the Section IV.A
convergence study, are in `meshes/`, and can be specified using
`-dm_plex_filename`, as in, which produces a diagnostic output file that can be
visualized; cf. Figure 4.

```console
$ mpiexec -n 2 ratel-quasistatic -dm_plex_filename meshes/holes-hex-q1-r0.msh -material fs-current-nh -E 2.4 -nu .4 -bc_clamp 1 -bc_traction 2 -bc_traction_2 0,.2,0 -order 2 -snes_monitor -ksp_converged_reason -dm_view -ksp_rtol 1e-3 -snes_converged_reason -mg_levels_ksp_max_it 2 -ksp_view_singularvalues -ceed /gpu/cuda -view_diagnostics
```

## Logs from experiments and replicating figures

We ran a suite of experiments on OLCF Crusher and Summit, NERSC Perlmutter, and
LLNL Lassen. These log files are organized in the `runs/` directory. The figures
are created by parsing output files into a Pandas `DataFrame` and plotting them
using Seaborn. All such figures can be reproduced by running `make` from within
the `runs/` directory.

Each output file contains a convergence log, profiling information, and
extensive provenance including which machine it was run on, compiler flags,
run-time arguments, etc.

Note that `runs/` contains some log files with earlier versions of the code or
with configurations that ended up not being reported (e.g., with/without
GPU-aware MPI). The log files referenced in the `Makefile` are all with an
equivalent version of the code. A curious reader may wish to explore other files
or further data that was not plotted. A few figures that were omitted from the
manuscript due to space are also created and the plotting script is reasonably
flexible to conduct further comparisons.
