PETSc4FOAM : a library to plug-in PETSc into the OpenFOAM framework

OpenFOAM acts as a major player in the Open Source CFD arena, due to its ﬂexibility, but its complexity also makes it more diﬃcult to correctly deﬁne performance ﬁgure and scaling. One of the main bottlenecks for a full enabling of OpenFOAM for massively parallel cluster is the limit in its MPI-parallelism paradigm, embodied in the Pstream library, which limits the scalability up to the orders of few thousands of cores. The proposed work aims to creating an interface to external linear algebra libraries for solving SpMV linear system such as PETSc/Hypre thus providing to the users a greater choice and ﬂexibility when solving their cases, and to utilise their respective Community’s knowledge which has been developed over decades and not currently accessible within the OpenFOAM framework.


Introduction
OpenFOAM acts as a major player in the Open Source CFD arena, due to its flexibility, but its complexity also makes it more difficult to correctly define performance figures and scaling in an HPC environment. From previous works, it is known that the scalability of the linear solvers restricts the parallel usage up to the order of few thousand of cores [1][2][3]. Meantime, the technological trends of exascale HPC is moving towards the use of order of millions of cores as reported in the November 2019 Top 500 List. The actual trend of the HPC cluster is to use nodes with high-density of cores, with order of O(10) cores per node, in conjunction with accelerators (GPUs, FPGAs, Xeon PHI, RISCV), with more cache level and with the order of O(100,000) of interconnected nodes and O(1,000,000) of available cores [4]. The current work aims to improve the HPC performance of the OpenFOAM, by means of a library to plug-in the PETSc into the OpenFOAM framework. PETSc stands for Portable, Extensible Toolkit for Scientific computation toolkit for advanced computation and it is pronounced PET-see (the S is silent). The paper is structured as follows: Section 1. briefly reviews the activity related to HPC optimization of the OpenFOAM problem solving component. In Section 2. the OpenFOAM framework is introduced with a focus on its parallel aspects and the actual HPC bottlenecks. Section 3. sets forth the concept of PETSc as OpenFOAM plug-in for the linear algebra solvers. The relevant works of coupling linear solver algebra packages into OpenFOAM are described and the PETSc framework with its main components is introduced. Further on, the actual sparse matrix storage system of OpenFOAM (LDU), of PETSC (CSR) and the converter (ldu2csr) are presented. The structure of the Petsc4FOAM library is listed. Section 4. describes the test-case used with reference to the OpenFOAM HPC benchmark suite. Section 5. shows the numerical results and quantifies the benefits of the library for the used test-case. Finally, Section 6. concludes the article with an outlook to the further work.

Proposed work
The current work proposes a direct approach to improve the performances by reducing the time for solving linear equation systems. Specifically, PETSc4FOAM has been developed to use PETSc (and other linear algebra solver packages) as external library into the OpenFOAM framework. The contributions of this work are: i Release of an HPC test-cases repository made public available [23] by means of the OpenFOAM HPC Technical Committee [24] ii Release of an interface to the PETSc linear solver toolkit, available on a repository to be publicly released in the near future [25] (a) A converter modifies the matrix format from the LDU to one of the matrix data structure available in PETSc, default is CSR (b) The whole bunch of solvers and preconditioners available can be invoked with a dictionary or a rc type file, for usability purpose iii Possibility to use any liner algebra solver packages within OpenFOAM, compatible with PETSc structure such as SuperLU DIST, SuperLU MCDT, MUMPS (direct solvers) or Trilinos, Hypre (algebraic preconditioners), or to include a custom one by means of the PETSc4FOAM library.

History of OpenFOAM
OpenFOAM (for "Open-source Field Operation And Manipulation") is the free, open source CFD software developed primarily by OpenCFD Ltd since 2004 [26,27]. Currently, there are three main variants of OpenFOAM software that are released as free and open-source software under the GNU General Public License Version 3 [28][29][30]. Ltd. The development model has a "cathedral"style [32], where code contributions from researchers are accepted back into the main distribution only after a very strict control of code base. The source code is available with each software release, but the code developed between releases is restricted to an exclusive group of software developers.

The execution flow of OpenFOAM
First and mostly, OpenFOAM consists of a C++ CFD toolbox for customized numerical solvers (over sixty of them) that can perform simulations of basic CFD, combustion, turbulence modeling, electromagnets, heat transfer, multi-phase flow, stress analysis, and even financial mathematics. It has a large user base across most areas of engineering and science, from both commercial and academic organisations. The number of OpenFOAM users has been steady increasing over past years. It is now estimated to be of the order of thousands, with the majority of them being engineering in Europe. It is one of the most used open-source CFD code in HPC environment. As a general CFD framework, the simulation with OpenFOAM is composed of three common procedures of CFD simulation: pre-processing, problem solving, and postprocessing, as shown in Figure 1. Preliminary work, such as geometric modeling, mesh generation, and parallel partition, are done in the preprocessing procedure. Problem solving is the main procedure of OpenFOAM, which includes physical modeling, equation discretization, and numerical solving. The reconstitution and visualization of data are managed in post-processing. The current work focuses on the problem-solving part of the OpenFOAM framework. The main features of the numerical method OpenFOAM relies upon are: segregated, iterative solution of linear algebra solvers, unstructured finite volume method (FVM), co-located variables, equation coupling (e.g. SIMPLE, PIMPLE, etc.). It uses C + + and object-oriented programming to develop a syntactical model of equation mimicking and scalar-vector-tensor operations [33]. Figure 2 shows a code snippet for the Navier-Stokes equations. Equation mimicking is quite obvious: fvm:laplacian means an implicit finite volume discretization for the Laplacian operator, and similarly fvm::div for the divergence operator. On the other hand, fvc::grad means an explicit finite volume discretization for the gradient operator. The parentheses (,) means a product of the enclosed quantities, including tensor products. The discretization is based on low order (typically second order) gradient schemes. It is possible to select different choices of iterative solvers for the solution of the linear solver algebra based on SpMVM (Sparse Matrix-Vector Multiplication), ranging from Conjugate Gradient up to Multi-Grid Methods. The temporal schemes can be selected among first and second order implicit CrankNicholson and Euler schemes [33]. Figure 3 is the dynamic flow execution of OpenFOAM's problem-solving procedure. First of all, the governing equation in mathematical form is converted to an OpenFOAM format. Each of the operators is discretized by the Finite Volume Method (FVM), then generates an lduMatrix, which is the matrix format used in OpenFOAM (see Section 3.5.). At last, all lduMatrices are integrated into a final linear equation system that can be solved by linear solvers like PCG (Preconditioned Conjugate Gradient) in OpenFOAM [34].

Parallel aspect of OpenFOAM
The method of parallel computing used by OpenFOAM is based on domain decomposition in which the geometry and associated fields are broken into pieces and allocated to separate processors for solution. The parallel paradigm is based on the standard MPI. A convenient interface, named Pstream, that is a stream-based parallel communication library, is used to plug in any MPI library into OpenFOAM. It is a light wrapper around the selected MPI interface [35]. Traditionally, FVM parallelization uses the halo layer approach: data for cells next to a processor boundary is duplicated. Halo layer covers all processor boundaries and is explicitly updated through parallel communication cells. Instead, OpenFOAM operates in zero halo layer approach, which gives flexibility in the communication pattern. FVM operations "look parallel" without data dependency [36].

Actual HPC bottlenecks of OpenFOAM
From previous works, it is known that the scalability of the linear solvers restricts the parallel usage up to the order of few thousand of cores [1,3]. Up to date, the well known bottlenecks for a full enabling of OpenFOAM for massively parallel clusters are: i The limit in the parallelism paradigm (Pstream Library) ii Sub-optimal sparse matrices storage format (LDU) that does not enable any cache-blocking mechanism (SIMD, vectorization) iii The I/O data storage system The recently formed HPC Technical Committee (TC) [37] aims to work together with the Community to overcome the aforementioned HPC bottlenecks. The priorities of the HPC TC Committee are: • Improve linear algebra solver limitation • Create an open and shared repository of HPC Benchmarks • GPUs enabling of OpenFOAM • Parallel I/O: Adios 2 I/O is now available as a regular OpenFOAM module [38]. The current work will focus on the first priority of the list, which in turn will affect the second and third HPC bottleneck. Specifically, a library named PETSc4FOAM is implemented to use PETSc [39], and other linear algebra solver packages, as external libraries into the OpenFOAM framework. As a representative example, Figure 4 reports the profiling of S test-case of Lid Driven cavity flow, described later on, in Sec. 4.1. From the profiling of Intel Advisor it is observed that most of the computational time, namely 67, 6%, is spent in the pressure solver part (that is Foam:fvMatrix<double>::solveSegragated function), and around 27% is spent in the velocity solver section (that is Foam:solve<Foam::Vector<double>> function).

Related work: coupling linear solver algebra packages into OpenOFOAM
HPC parallel aspects of OpenFOAM are well-known and described in many works [1,3,12]. There is an overall agreement in the community that the linear algebra solving process is the most-consuming task and that affects its parallel scalability. Previous examples of plug-in of a linear solver algebra library into OpenFOAM are detailed in [40,41]. Duran et. al [40] studied bio-medical fluid flow simulation by embedding the incompressible, laminar solver icoFoam with other direct solvers (kernel class) such as SuperLU DIST 3.3 and SuperLU MCDT (Many-Core Distributed) [42] for the large penta-diagonal and hepta-diagonal matrices coming from the simulation of blood flow in arteries with a structured mesh domain. They observed speed-up up to 16, 384 cores for the largest matrix of sizes up to 64 million × 64 million. The authors support the idea of decoupling the effort in the linear algebra solver from the effort spent in the numerical solution of PDEs using the finite volume algorithm on unstructured arbitrary mesh. This approach is beneficial for the OpenFOAM framework, not only for a performance gain but also for the software quality improvement and the reduction of time spent in software development and maintenance. Furthermore, the plug-in of PETSc into OpenFOAM provides the open-access to the state-of-the art linear algebra solvers and preconditioners (e.g. the BoomerAMG preconditioner provided by the Hypre package for massively parallel computations [43]), developed and maintained by the PETSc community or from other Public Research Institutes. PETSc is a library well-known for its high performance and good scalability. For these reasons it has been adopted in many areas (CSM, CFD, Aerodynamics, two-phase flow, porous flow, to name a few) and in many frameworks [44][45][46] or application codes [47][48][49][50]. Success stories of performance improvement by switching to PETSc, see [47,[51][52][53], and the recent work [41] have inspired this work. Li et al. [41] report an optimization of the OpenFOAM framework by inserting PETSc library to speed up the process of solving linear equation systems. Numerical results on a HPC cluster showed that such optimization can reduce the time for solving PDEs by about 27% in the 2-D lid-driven cavity flow and more than 50% in a 2-D flow past a circular cylinder in a uniform stream, when compared to the best solution with the original OpenFOAM. Both cases consists of a fixed mesh size of 192,000 cells; strong scaling tests have been performed from 6 up the 786 cores. The analysis have been profiled with valgrind tool. Final test-case reported is a 3-D incompressible flow around a submarine, with mesh size of 1 millions of cells. The minimal execution time of OpenFOAM-PETSc is almost 40% less than that with the original OpenFOAM. The current work relies upon this precursor effort, and aims to push forward the activities in terms of: • Public release of a library under the supervision of the main developers of OpenFOAM • Very large test-cases (order of ten of millions of cells) running on massively parallel cluster (order of thousands of cores) • Use of HPC profiling tools to spot bottlenecks in HPC scenario • Systematic comparison among same or equivalent solvers and/or preconditioners to solve the sparse linear algebraic system • Evaluation of the residual norm in the same way as done by OpenFOAM, that is the scaled L1 norm • Enabling cache in memory of the coefficient matrix A and its preconditioner • Possibility to extend the library in order to use GPUs through CUDA or OpenCL in PETSc [54]

The PETSc framework
PETSc is a suite of data structures and routines for the scalable (parallel) solution of scientific applications modeled by partial differential equations and sparse matrix computations. PETSc includes a large suite of parallel linear, nonlinear equation solvers and ODE integrators that are easily used in application codes written in C, C++, Fortran and Python. PETSc provides many of the mechanisms needed within parallel application codes, such as simple parallel matrix and vector assembly routines that allow the overlap of communication and computation. It supports MPI, and GPUs through CUDA or OpenCL, as well as hybrid MPI-GPU parallelism [39,55]. PETSc consists of a variety of libraries (similar to classes in C++), which manipulate a particular class of objects and the operations one would like to perform on the objects. Some of the PETSc modules includes: index sets, vectors, matrices, over thirty Krylov subspace methods, dozens of preconditioners, including multigrid, block solvers, and sparse direct solvers. Each module consists of an abstract interface and one or more implementations using particular data structures. The library enables easy customization and extension of both algorithms and implementations. Figure 5 is a diagram of the interrelationships among different pieces of PETSc, which shows the library's hierarchical organization [55].

The execution flow of OpenFOAM-PETSc
After the insertion of the PETSc library into the OpenFOAM framework, the execution flow of OpenFOAM is unchanged except for the solving part. Instead of calling one of the built-in solver based on the LDU matrix format, the final lduMatrix is converted into a PETSc matrix, which can be digested by a PETSc solver, made of a KSP (Krylov SubsPace) and a PC (PreConditioner) object. The matrix conversion introduces a slight overhead (see Fig. 14 later on), but it has the advantage to have a null impact on the OpenFOAM source code.

Iterative solution of sparse linear systems: preconditioning techniques
The solution of large sparse linear systems of the form where A = [a ij ] (for symmetric matrices a ij = a ji ∀ i = j ) is an n × n matrix and b is a given right-hand-side vector, derives typically from the discretization of the PDEs of elliptic and parabolic type, such as the governing equation OpenFOAM relies upon (eq. (6) later on). Direct methods are based on the factorization of the coefficient matrix A into easily invertible matrices. It is possible to efficiently solve, in a reasonable amount of time, linear systems of fairly large size particularly when the underlying problem is two dimensional. Unfortunately, direct methods scale poorly with problem size in terms of operation counts and memory requirements, especially on problems arising from the discretization of PDEs in three space dimensions. Detailed, three-dimensional multiphysics simulations lead to linear systems comprising hundreds of millions or even billions of equations as many unknowns. For such problems, in the majority of cases, iterative methods are the only option available [56]. As widely reported in literature [57], the term preconditioning refers to transforming the system of eq. (1) into another system with more favorable properties for iterative solution; a preconditioner is a matrix that effects such a transformation. Mathematically, preconditioning attempts to improve the spectral properties of the coefficient matrix. For symmetric positive definite (SPD) problems, the rate of convergence of the conjugate gradient method depends on the distribution of the eigenvalues of A. Hopefully, the transformed (preconditioned) matrix will have a smaller spectral condition number, and/or eigenvalues clustered around 1.
For nonsymmetric (nonnormal) problems the situation is more complicated, and the eigenvalues may not describe the convergence of nonsymmetric matrix iterations like GMRES (Generalized Minimal Residual Algorithm) for solving nonsymmetric linear systems.
If M is a non-singular matrix that approximates A (in some sense), then the linear system has the same solution as eq. (2) but may be easier to solve. Here M is the preconditioner. In cases where M −1 is explicitly known (as with polynomial preconditioners or sparse approximate inverses), the preconditioner is M −1 rather than M . System of eq. (2) is preconditioned from the left, but one can also precondition from the right: When Krylov subspace methods are used, it is not necessary to form the preconditioned matrices M −1 A or A M −1 explicitly (this would be too expensive, and we would lose sparsity). Instead, matrix-vector products with A and solutions of linear systems of the form M z = r are performed (or matrix-vector products with M −1 if this is explicitly known). In addition, split preconditioning is also possible, i.e.: Which type of preconditioning to use depends on the choice of the iterative method, problem characteristics, and so forth. Generally speaking, the linear system must be preconditioned to achieve convergence. Convergence is accelerated if M −1 resembles, in some way, A −1 ; conversely, M −1 must be sparse at the same time, so as to keep the cost for the preconditioner computation, storage and application to a vector as low as possible. Generally speaking, preconditioning is "the art of transforming a problem that appears intractable into another whose solution can be approximated rapidly" [58].

Sparse matrix storage formats: ldu2csr converter
While for dense matrices it is usually reasonable to store all matrix entries in consecutive order, sparse matrices are characterized by a large number of zero elements, and storing those is not only unnecessary but would also result in significant storage overhead. Different storage layouts exist, that aim for reducing the memory footprint of the sparse matrix by storing only a fraction of the elements explicitly, and anticipating all other elements to be zero. The matrix A, showed on the right-hand side of Fig. 6, is a representative example that derives from numerical discretization of a Poisson solver over the two-dimensional numerical grid 3 × 3 displayed to the left hand-side of

csrMatrix format
The default matrix format in PETSc is the Compressed Sparse Row (CSR), but the toolkit implements many other matrix storage format (ELL, SELL, Hybrid, to name a few). CSR format has a high storage efficiency. Fig. 7 shows the CSR representation of the same two-dimensional computational grid and the corresponding matrix A shown in Fig. 6. The non-zero elements of the matrix A are stored in a single contiguous array, named value. Two auxiliary vectors, named respectively col index and row index, are needed to visit the matrix coefficients: col index provides the column indices of those values; while row index has one element per row of the matrix A and encodes the index in value where the given row starts. The size of value and col index is the number of cells, while the size of row index is the number of rows plus 1.

ldu2csr converter
PETSc provides the abstract type Mat and the MatSetValue() API to set the a ij coefficient for every type of matrix. According to the value set with MatSetType, a different storage format is used. Some of the MatTypes available in PETSc are reported in Table 1. For sake of completness: MATMPIAIJMKL creates a sparse parallel matrix whose local portions are stored as SEQAIJMKL matrices (a matrix class that inherits from SEQAIJ but uses some operations provided by Intel MKL); MATMPISBIAJ matrix type is based on block compressed sparse row format and only the upper triangular portion of the matrix is stored; in the MATSELL format the block size makes the connection between the slim CSR format and the GPU-friendly ELLPACK format, for cross-platform efficient use [60].

MatTypes
Description MATAIJ "aij" type for sparse matrices MATSBAIJ "sbaij" type for symmetric block sparse matrices MATMPIAIJ "mpiaij" type for parallel sparse matrices MATMPIAIJMKL type for parallel sparse matrices with Intel MKL support MATMPISBIAJ "mpisbaij" type for distributed symmetric sparse block matrices MATHYPRE "hypre" type sequential and parallel sparse matrices based on the hypre ij interface MATSELL "sell" type for sparse parallel matrix in SELL format To convert an LDU matrix to a CSR matrix, we propose a traversal-based algorithm, as described in [41] and shown in the Algorithm 1 in pseudo-algorithm form. For good matrix assembly performance, PETSc requires to pre-allocate the matrix storage by setting the number of non-zero per rows both in the diagonal and in the offdiagonal portion of the local submatrix. In order to reduce the overhead of the conversion, we implemented the same optimizations proposed by [41]. Firstly, a special treatment for symmetric matrices is introduced, in which the a ij element is equal to the corresponding a ji element. Therefore, we can double the conversion efficiency by converting only the diagonal and upper triangular portion of A (lines 4-14 of Algorithm 1). Secondly, we can reduce the access frequency of the matrix elements; as described above, the non-zero values of the matrix are stored, in CSR format, in a contiguous array value row by row, ordered left-right and top-down, which is the same order used to store the upper triangular matrix U in the vector upper for the LDU format. Thus, we can just copy row by row from the LDU into the CSR matrix, rather than copy by element (lines 21-29 of Algorithm 1). The LDU matrix is traversed once for all to compute the exact number of non-zero values per rows, as motivated above, and one time per time step (except for caching, see Sec. 3.6. later on) to set the coefficients in the PETSc matrix.   The performance of the algorithm have been tested on a 3D Poisson problem (representative of the pressure solver solved in the Navier-Stokes equations) over a uniform mesh, see Table 2 and Fig. 8. The matrix size, equal to the cube of the number of rows, is increased by doubling the initial number of rows, up to a value of 8 millions. The average time (time) is the time averaged of different tests of the same configuration (run ID), by using two different Mat types: MATMPISBAIJ and MATMPIAIJMKL. The single test runs a kernel code which consists essentially of the matrix conversion. ratio is the ratio between the time employed to convert the matrix in the corresponding run ID over the previous one. mat ratio is the ratio between the average time of the two Mat type analyzed for the same run ID. Figure 8 shows that the algorithm scales linearly (with the factor of 8) with run ID from 1 up to 4, for both the Mat type analyzed. Moreover, it is possible to gain a speed-up of 1.7 if the symmetry property of the matrix is exploited, when using MATMPISBAIJ type in place of MATMPIAIJMKL one. For sake of completeness, Fig. 8 reports also the results for the MATMPIAIJ type. The trend is overlapping with the MATMPIAIJMKL type one.

The PETSc4FOAM library
The PETSc4FOAM library has been designed to be decoupled from the compilation of OpenFOAM; it is released as a standalone git submodule [25]. The dependencies of the library are obviously OpenFOAM and PETSc. The code structure of the library is shown in Figure 9. The code is splitted in three main folders: etc, tutorials and src. We collect in etc all the utility files, in src the source files of the library and in tutorials some examples that show how to use it. All the Allwclean and Allwmake files are required by the wmake build system.  The source files are splitted in three folders, Make, csrMatrix/solvers and utils. In Make there are two files needed by the wmake build system. In csrMatrix/solvers folder we added a petscSolver.C file that contains the implementation of the petscSolver class, a C++ wrapper of the petsc KSP solver object. This class, that inherits from lduMatrix::solver, is responsible for the following steps: • Run the LDU2CSR algorithm in order to convert the LDU matrix into a Mat CSR matrix, if needed In utils we collect all the classes that are needed by the petscSolver class. In particular: • petscControls.C : a utility class which is responsible to initialize and finalize correctly the PETSc library • petscLinearSolverContexts.C : a utility class which is responsible to keep alive all the PETSc objects needed by petscSolver.C between successive iterations • petscCacheManager.C : a manager class to cache the CSR matrix and its preconditioner between successive iterations • petscWrappedVector.H : a utility class to wrap a OpenFOAM vector (e.g rhs) with a PETSc vector using a shallow copy paradigm • petscUtils.C : collection of utility functions useful by all the classes of the PETSc4FOAM library The user has two viable ways to set-up the PETSc options: i the PETSc approach by setting the options in a petscrc file ii the OpenFOAM approach by setting them in a petsc/options subdictionary of the fvSolution file.
All the options, not related to PETSc, are set using the OpenFOAM approach. An example is reported in the code of Listing 1.

caching update
In Listing 1 we have shown an example of usage of the caching algorithm implemented in the library for the solver matrix and its preconditioner. This algorithm stores in the RAM memory the matrix's coefficient and its preconditioner in the CSR format at a given time step for reusing in the subsequent time-steps according to the frequency set by the user. The user can select the cache update frequency among the following choices • always (default) • never • periodic • adaptive For the first option "always", the caching update is done at each time step of the time dependent simulation; this is the library's default configuration. In the case of the second option "never", the storing of the matrix's coefficients is done once at the beginning of the time dependent simulation. The coefficients are re-used for all the time steps. In the case of constant coefficients' matrix, no error is introduced and a gain in term of performance is obtained by avoiding to convert and store the matrix at each time step. The third options "periodic" means that the cache is updated every n steps. The fourth option "adaptive" is the more complex one; it implements an adaptive procedure that computes the number of iterations after which the cache is updated. It is based on the following formula, proposed in [62]: where T 0 is the solver time after the matrix cache update, T 1 is the solver time after the first step from the matrix cache update and T is the current solver time.

OpenFOAM HPC Benchmark suite
The code repository for the HPC TC [23] is a shared repository with relevant data-sets and information created in order to: • provide user guides and initial scripts to set-up and run different data-sets on different HPC architectures • provide to the community an homogeneous term of reference to compare different hardware architectures, software environments, configurations, etc.
• define a common set of metrics/KPI (Key Performance Indicators) to measure performances

3-D Lid-driven cavity flow
The first test-case chosen is a 3-D version of the lid driven cavity flow tutorial [63]. This test-case has simple geometry and boundary conditions, involving transient, isothermal, incompressible laminar flow in a threedimensional box domain. The icoFoam solver, which solves the Navier Stokes equations: using the PISO algorithm, is adopted [64,65].

Definition of geometrical and physical properties
In such simple geometry all the boundaries of the square are walls. The top wall moves in the x-direction at the speed of 1 m/s while the other five are stationary. Three different sizes have been selected: S, M and XL for Small, Medium and eXtra-Large test-case, respectively. Table 3 and Fig. 10 show the geometrical and physical properties of the different test-cases, which have an increasing number of cells: 1 million (m) (S), 8 m (M ) and 64 m (XL) of cells, obtained by halving ∆x when moving from the smaller to the bigger test-case. The Courant number Co = U ∆t ∆x is kept under stability limit, and it is halved when moving to bigger test-cases. The time step ∆t is reduced, proportionally to Co and ∆x, by 4 times. The physical time to reach the steady state in laminar flow 1 is T = 0.5. For the results shown in following Sec. 5., the test-case XL is used with a number of iteration equal to 100 to reduce the computational time.

Methodology
The algorithm implemented in the icoFoam solver consists mainly of two parts, one is the solution of the momentum equation and the other one is the solution of the pressure equation (or continuity equation). The momentum equation is discretized by the Finite Volume method (FVM) using the under-relaxation to ensure the matrix diagonal dominance. This leads to a linear system that can be solved in a high efficient way. The pressure equation in icoFoam is essentially the discrete form of the Poisson equation and it has been shown by many studies [66] that can be solved efficiently by the conjugate gradient (CG) algorithm if a proper choice of preconditioner is made. Since profiling analysis (see Fig. 4) has shown that most of the computational time is spent during the solution of the pressure equation (around 67%), we focus our effort in the comparison of different preconditioner, keeping fixed the solver/preconditioner pair for the momentum equation. Comparison is performed between the pressure solvers reported in Table 4. FOAM-DIC-PCG is an iterative solver provided by the OpenFOAM package that combines a diagonal-based incomplete-Cholesky preconditioner (DIC) with a Conjugate Gradient solver (PCG). DIC is a common preconditioner in OpenFOAM for its easy configuration   [67]. FOAM-GAMG-PCG is an iterative solver implemented in OpenFOAM that uses a Conjugate Gradient (CG) accelerated by a generalized geometric-algebraic multigrid (GAMG) method that supports both geometrical and algebraic multigrid. Its parameters are set in Table 5 and chosen according to our experience. PETSc-AMG-CG is the PETSc counterpart of the FOAM-GAMG-PCG method. In this solver we use an algebraic multigrid method, named BoomerAMG, provided by the third-party library of PETSc named Hypre. All parameters of BoomerAMG are set in Table 5. These parameter values have been chosen empirically, after a series of test aiming to minimize the single iteration time and maximizing the scalability in the range 8-128 nodes.
PETSc-AMG-CG + caching is the same solver described above but the matrix is converted only once at the beginning and cached together with the preconditioner for all the time-steps. In this particular case of constantcoefficients matrix, the numerical solution is equivalent to the case without caching.  In the first case, FIXEDITER, the computational load is fixed; two solving methods with a different rate of convergence will exit with a different norm. This case is useful for comparing different hardware configurations by keeping constant the computational load.
In the second case, FIXEDNORM, the exit norm is the same, whereas the number of iterations is generally different. Unfortunately, the norm implemented in the standard release of OpenFOAM is not implemented in PETSc preventing from using the second criterion. The residual norm implemented in OpenFOAM is a L1-norm scaled by the following factor: In formula we have The FOAM-L1 norm has been implemented in the PETSc4FOAM package as a new convergence criteria. A new function, named foamKSPConverge, computes the custom residual norm as in eq. (8). This function is handled in PETSc as an argument of the KSPSetConvergenceTest function. 2 For the results shown in Sec. 5., all the tests are performed using the FIXEDNORM criterion, with a exit norm value of 1.0e −04 .

Platform and software stack
All

Performance and scalability results
Firstly, we investigate the strong scalability of the PISO 3 algorithm keeping fixed the preconditioner/solver pair of the momentum equation and using the solvers for the pressure equation described in Table 4. The mesh size is kept constant (case XL of Table 3) and the process number ranges from 288 (8 nodes) to 4608 cores (128 nodes). At the same time, the number of cells per cores varies from ≈ 220, 000 down to ≈ 13, 000. Let's consider that the value of ≈ 20, 000 cells per core is considered the minimum threshold to get good scalability when running OpenFOAM in parallel mode. Total (wall clock) time results are shown in Table 6 and in Figure 11. The average time per time step is shown is Figure 12. Speed-up is shown in Figure 13. Table 7 reports the PCSetUp time, the single iteration time, the number of iterations and the total time using 8 nodes (288 cores).
The results show clearly that using few nodes (e.g. 8 nodes) the multigrid preconditioners outperform the incomplete factorization ones. Table 7 reports that, even if the cost of the set-up of the Multigrid preconditioner is much higher respect to Incomplete Factorizations, both the single iteration time and the number of iterations to converge for the multi-grid preconditioners are lower. The final result is a faster preconditioner/solver combination for the PISO-FOAM-GAMG-PCG and PISO-PETSc-AMG-CG in the 8 nodes case. A further improvement is obtained with the caching (i.e. PISO-PETSc-AMG-CG + caching)) due to the null time for the PCSetUp in this case. The same result does not hold using much more nodes. The CG preconditioned by the Incomplete Factorizations scales super-linearly with respect to the Multigrid, as shown in Figure 13, while PETSc-AMG-CG has an excellent linear scalability up to 64 nodes (2304 cores) and FOAM-GAMG-PCG only up to 16 nodes (576 cores). Therefore, this difference in terms of total time between the CG preconditioned by the Incomplete Factorizations and the two multi-grid methods is less pronounced at 32 nodes with respect to 8 nodes. The final result is that PISO-PETSc-AMG-CG is 6 times faster at 8 nodes with respect to PISO-FOAM-DIC-PCG but only twice at 128 nodes. Better results can be achieved if the caching is turned on in the PISO-PETSc-AMG-CG solver, both in terms of scalability and performance. From this analysis it is clear that PISO-PETSc-AMG-CG is the best solver. Even if it is not the faster solver using 8 nodes 4 , its excellent scalability and rate of convergence outperforms all the other solvers in all the other cases (from 16 to 128 nodes). As we already said, PETSc-AMG-CG is the pressure solver of the PISO algorithm. Figure 14 shows how much time in percentual is spent in the pressure solver respect to the momentum solver and LDU2CSR conversion in the range of 8 − 128 nodes. The plot shows that the overhead of the conversion is around 2% and decreases slightly by increasing the number of nodes. The ratio between pressure solver and momentum solver is not constant and decreases from 79% to 55% when moving from 8 to 64 nodes. In particular we found a bigger jump moving from 64 to 128 nodes. Figure 15 shows the average time per time step of the PETSc functions PCSetUp, PCApply and KSPSolve in the interval of 8 − 128 nodes. The plot shows the same trend of Figure 11, that is an excellent scalability up to 64 nodes. The worsening of the scalability using 128 nodes is justified by a very low number of cells per core, below the threshold of 20,000 cells per processor (and as consequence a very little average time per time step to be measured). We are more than confident that an excellent scalability up to ten of thousands of core can be achieved with this solver set-up if a bigger case is used, keeping the number of cells per core always above the threshold of 20,000 cells per processor. Moreover, the add-on of a PETSc solver for the momentum equations has to be investigated, and it is expected a further time reduction (see Fig. 14).

Remark 1
In the simulations reported in this work, the computational set-up used is at full computational density, that is using the whole cores availables per node (full populated node). In other words, we performed strong scaling test by using a number of MPI processes per node equal to the number of cores per node. This configuration is adequate for tests but not the best one for production runs, being such configuration highly memory-bounded. In fact, Fig. 16 reports the Bound in percentage 5 and the memory bandwidth versus the computational density (i.e. number of cores per node) for the FOAM-DIC-PCG configuration. It is observed that the average and the peak memory bandwidth slightly decreases passing from the full to half populated node configuration. The peak memory bandwidth is near to the theoretically maximum value. A dramatically decrease of these quantities occurs in the quarter of a node configuration. The bound quantity goes to zero and the code switches from a MEMORY BOUND to a MPI-BOUND configuration.

Solution time [s] with different number of cores Solver
288 (8) Table 4 19 22 nd Jul, 2020   Table 7: Preconditioner set-up time, single PISO iteration time, CG iterations (at the second PISO iteration) and total time using 288 cores (8 nodes)

Conclusion
This work has presented a direct approach to improve the performance of OpenFOAM in a HPC environment by reducing the time for solving linear equation systems. Specifically, a library PETSc4FOAM has been developed to use PETSc, and other linear algebra solver packages, as external libraries into the OpenFOAM framework. The PETSc4FOAM library relies upon a ldu2csr converter, to pass from the LDU format to CSR. The performances of such algorithm have been tested using two different Mat type of PETSc (MATMPISBAIJ and MATMPIAIJMKL). It scales linearly by increasing the matrix size, and the overhead of the conversion in the reported tests is around 2% decreasing slightly when increasing the number of nodes. The test-case investigated is a 3-D version of the lid-driven cavity flow with 64 millions of cells (test-case XL of Tab. 3). The icoFoam solver has been used for the numerical tests. The pressure solver is the most computational intensive part of the code (typically around 70% of the total time). Our numerical tests focus on the comparison of different preconditioners, keeping fixed the solver/preconditioner pair for the momentum equations. The iterative methods have been run in the FIXEDNORM case, that is by using the same exit norm for the two different class of iterative methods.
The results show clearly that using few nodes (e.g. 8 nodes) the multi-grid preconditioners outperform the incomplete factorization ones. In fact, the PISO-PETSc-AMG-CG is 6 times faster at 8 nodes (64 cores) with respect to PISO-FOAM-DIC-PCG, but only twice at 128 nodes (4608 cores). The reason is that the CG preconditioned by the Incomplete Factorizations scales super-linearly with respect to the Multi-grid in the full-populated node configuration. Conversely, the FOAM-GAMG-PCG scales well only up to 16 nodes (576 cores) while PETSc-AMG-CG has an excellent linear scalability up to 64 nodes (2304 cores). Better results can be achieved if the caching is turned on in the PISO-PETSc-AMG-CG solver, both in terms of scalability and performance. From this analysis, we can conclude that the PISO-PETSc-AMG-CG+caching is the best solver, due to its excellent scalability and rate of convergence.