Advantages of static condensation in implicit compressible Navier-Stokes DGSEM solvers

We consider implicit time-marching schemes for the compressible NavierStokes equations, discretised using the Discontinuous Galerkin Spectral Element Method with Gauss-Lobatto nodal points (GL-DGSEM). We compare classic implicit strategies for the full Jacobian system to our recently developed static condensation technique for GL-DGSEM Rueda-Ramı́rez et al. (2019), A Statically Condensed Discontinuous Galerkin Spectral Element Method on Gauss-Lobatto Nodes for the Compressible Navier-Stokes Equations [1]. The Navier-Stokes system is linearised using a Newton-Raphson method and solved using an iterative preconditioned-GMRES solver. Both the full and statically condensed systems benefit from a Block-Jacobi preconditioner. We include theoretical estimates for the various costs involved (i.e. calculation of full and condensed Jacobians, factorising and inverting the preconditioners, GMRES steps and overall costs) to clarify the advantages of using Preprint submitted to Computers & Fluids May 28, 2020 Man scrip C c he e acce /d ad;Ma c ;S a c_Pa e _ _3Sec dRe e . Click here to ie linked References

solvers for turbulent problems, which is out of the scope of this work. Point 81 ILU has also been successfully used for aerodynamic applications in [40,41]. 82 Persson and Peraire [32] or Gopalakrishnan and Kanschat [42] showed that 83 element-block based preconditioners are essential to eliminate high p depen-84 dent errors. It is also very natural to exploit the element-block structure of 85 the Jacobian (specially in the parallel computations due to the block locality 86 that enables to perform block inversions locally), as most of these methods 87 require the direct factorisation of block matrices. Note that this can become 88 troublesome for high polynomial orders, especially in three-dimensional flows. 89 In this work, we select Block-Jacobi preconditioner and show that when con-90 densing the system, the preconditioner scales more gently for high polyno-91 mials, than the preconditioner for the full system. This translates into lower 92 costs for all the steps where the preconditioner is required (i.e. factorisation 93 of the blocks and GMRES step involving the preconditioner), and paves the 94 way to using high polynomial orders e ciently. 95 Our comparisons are novel in that the static condensation technique, re-96 cently developed for GL-DGSEM by the authors, is directly challenged to 97 the state of the art implicit preconditioned-GMRES solvers to show com-an analytical Jacobian. Further details on how the Jacobian can be ob-120 tained along with the peculiarities and sparsity patterns resulting from using 121 Gauss-Lobatto nodal points, can be found in our previous works [1,44].
where Q is a vector that stores the conservative variables in all degrees 127 of freedom of the domain, F (Q) encompasses both discrete convective and 128 di↵usive fluxes, M is the mass matrix, which is diagonal in the nodal DGSEM 129 approach, and S is a source term. 130 We replace the continuous in time derivative in (1) by a discrete implicit time integration scheme using Backward Di↵erentiation Formulas (BDF) of order 1 and 2 (BDF1 or BDF2), @Q @t where the operator Q/ t is a function of the solution on the next time step, Q s+1 (the unknown), the current time step, Q s , and possibly previous time steps. When treated implicitly, the nonlinear operator F , in equation (1) is evaluated for the unknown solutions, Q s+1 . Considering this, equation (1) can then be rewritten as Note that in the DGSEM approach the mass matrix M is diagonal and can 131 be trivially inverted, leading to an e cient discontinuous Galerkin method.

132
When computing steady flows, we are not interested in producing an accurate 133 solution in time, and therefore we use an implicit BDF of order 1 to advance 134 until steady state. However, for unsteady cases we will use an implicit BDF 135 of order 2 and shorter time steps to obtain accurate solutions in time. 136 The nonlinear system of equations, (3), can be solved using Newton-

137
Raphson iterations to obtain the linear system: where A = @R @Q (Q s+1 ) is the Jacobian matrix evaluated atQ s+1 , which is 139 an approximation to the unknown solution Q s+1 . The right-hand-side is In the GL-DGSEM framework, we can statically condense system (4) to 147 obtain the following form where subindex b and i denote boundary and interior nodes, respectively.

149
The main interest of the method is to obtain a block diagonal matrix A ii , 150 that can be inverted cheaply and locally (element by element). Additionally, 151 the boundary matrix including the degrees of freedom linking boundaries 152 between elements, is greatly reduced by the use of Gauss-Lobatto points in 153 DGSEM [1]. The resulting system is equivalent to the full system, but can 154 be decoupled in two subsystems. The first one for the skeleton of the mesh, 155 our condensed system of equations is where it is trivial to substitute and solve for the second system One of the main advantages of the static condensation is the reduced size of the matrix A cond (with only the mesh skeleton degrees of freedom) in comparison with the original Jacobian matrix A (with all the degrees of freedom in the mesh). We can quantify the number of degrees of freedom for our GL-DGSEM discretisation. The Jacobian matrix A has size where N el is number of elements and nb is the size of each element-block.
Then, assuming mesh elements with isotropic polynomial order P , we can describe the size of each block nb as a function of P , the dimension d (e.g. d = 3 for 3D meshes) and the number of conservative variables (or equations) in the computational domain for the Navier-Stokes equations N eq (e.g. N eq = 5 in 3D): Equation (7) can also be used to describe the size of the matrices, A ii and A bb , involved in the Schur complement computation and included in the statically condensed system (5) with n ii = N el · nb ii and n bb = N el · nb bb , with the only di↵erence being the block sizes. Here, the block size of the element-skeleton matrix nb bb directly corresponds to the size of the block of the final Schur complement A cond . The blocks for the condensed matrix arise from having decoupled element interior i from the element boundary nodes b, leaving fewer degrees of freedom per block. Thus, the size of the block of matrix A ii , that corresponds to the interior of the elements is Consequently, the size of the block of A bb and A cond can be defined as the di↵erence between the size of the element-block and the interior element part and with these blocks, the final size of the matrices could be easily computed 163 from equation (7).

164
Additionally, it is possible to obtain estimates for the number of non-zero 165 entries nnz in the full and condensed Jacobian. This is not a trivial task, and  Table 1, for 3D and 2D.
168 Table 1: Explicit formulas for the leading terms of block sizes, estimation of number of non-zeros nnz per block, and matrix non-zero entries, for the full and condensed systems in 2D and 3D. All provided as functions of the number of elements N el , polynomial order P and number of conservative variables in the 3D domain, i.e. N eq = 5 for the compressible Navier-Stokes equations.

3D
Full system Condensed system Finally, the condensed system presents smaller and denser blocks and 181 the block stencil of the condensed system is wider than the one of the full 182 system. As a result, the nnz of the condensed system is larger than the one 183 of the full system. Regarding the total number of non-zero entries in the 184 matrix, the scalings show that the full system will asymptotically contain 185 more non-zero entries for large polynomial orders. However, due to the denser 186 connectivity in the condensed system, the non-zero entries can be higher for 187 low polynomial orders.

188
In the Continuous Galerkin formulation for simple di↵usion or advection- for the GL-DGSEM approach may be found in our previous work [1]. In this 200 work, we concentrate on comparing the e ciency of solving the linear system 201 of equations, i.e. solving full system (4) to solving the two subsystems for the 202 condensed system (6) using iterative methods. To account for the iterative 203 costs, we will use the matrix sizes and number of non-zeros, included in Table   204 1.

268
• Step 13: cost for solving the linear system (4) for the full system or (6) 269 for the condensed system, using the preconditioned-GMRES solver at 270 each time step and as long as || Q|| 1 < TOL Newton .

271
Algorithm 1 Time-marching scheme including Newton-Raphson linearisation if InaccurateJacobian then if CondensedSystem then Step 13 solves the linear system using preconditioned-GMRES (further 272 discussed below) and one must account for its cost in every Newton iteration 273 and for every time step. Steps 5 to 9 need to be computed when the Jaco-  The necessary operations to obtain the condensed system (6) are detailed 293 here:

294
• Factorisation and inverting the block diagonal matrix representing inner-

296
• Computing A 1 ii A ib and assembling the RHS of the equation (6),

298
• Obtaining the solution for the interior nodes: All of these operations are included in one unique cost, referred to as con-300 densation cost, in the following sections. These operations are performed in 301 Step 8 (7) and (9), the resulting cost of factorising this 308 matrix is N el N 3 eq (P 1) 9 in 3D and N el N 3 eq (P 1) 6 in 2D.

309
The second important operation is the Sparse Matrix-Matrix multiplica- ii has dense blocks of size nb ii = N eq (P 1) d and that the num-319 ber of non-zeros is larger in A 1 ii than in the very sparse A ib (see Appendix 320 C.27 for the estimation of the number of non-zeros in o↵-diagonal blocks of 321 the Jacobian matrix, which scales as N 2 eq (P + 1) 2 (4P + 1)). Taking into ac-322 count that the size of the blocks of the Schur complement is nb bb = N eq (6P 2 + 323 2) in 3D and nb bb = N eq 4P in 2D, we approximate the cost of the SpGEMM  A of size N el nb, which has an operation count of N el N 3 eq (P + 1) 9 in 3D and 339 N el N 3 eq (P + 1) 6 in 2D. If the condensed system is considered, we factorise 340 the skeleton-element blocks of matrix A cond of size N el nb bb , which has a cost Algorithm 2 Preconditioned GMRES-Solver for j = 1, ..., m do 5: The main costs within the GMRES iterative solver, arise from Sparse terms of (P, N eq , d), as summarised in Table 2. Since the preconditioner is a 382 locally dense matrix (block diagonal part is dense, while the o↵-diagonal parts 383 are empty), we can bound the number of non-zero entries by the number of 384 total entries in the diagonal blocks nnz = N el nb 2 . Therefore, the cost of the 385 preconditioner-SpMV Z = P 1 V, presented in step 5 in Algorithm 2 can be 386 expressed as N el N 2 eq (P +1) 6 in 3D and N el N 2 eq (P +1) 4 in 2D, if the full system 387 is considered. For the condensed system, the costs are N el N 2 eq (6P 2 + 2) 2 and 388 N el N 2 eq 16P 2 for 3D and 2D, respectively. The main preconditioned-GMRES 389 costs are included in Table 2.   and O(P 6 ) for the full system. These advantages are also expected in 2D 407 simulations.
408 Table 2: Summary of the estimated leading costs of main operations in Algorithm 1 for 2D and 3D. Full and condensed systems are included.

Full system
Condensed system Nonzero entries P i Jac nnz full nnz cond 2 3 6.4 3.5 6.4 3.5 6.5⇥10 5 1.2⇥10 6 3 3 6.1 4.4 6.1 4.4 2.2⇥10 6 6.6⇥10 6 4 3 5.9 5.7 6.1 5.5 6.0⇥10 6 2.2⇥10 7 5 3 6.1 6.7 6.1 6.5 1.3⇥10 7 5.6⇥10 7 6 4 6.3 7.7 6.4 7.6 2.8⇥10 7 1.2⇥10 8 7 4 6.5 8.6 6.6 8.5 5.3⇥10 7 2.2⇥10 8 8 5 6.6 10.0 6.6 9.9 9.4⇥10 7 3.9⇥10 8 Although the averaged linear solver iteration count is the same for both   Table 2) for 493 high enough polynomial orders. Discrepancies at low orders are attributed 494 to the relatively small 3D problem considered and the e↵ect of boundary 495 conditions. In any case, it can be seen that despite the cost of condensing the 496 system, the solver cost benefits from the condensation (Figure 2b), leading to 497 overall faster solves, which illustrates the beneficial e↵ect of using a condensed 498 system for the higher polynomial orders.     In this section, we show that the statically condensed DGSEM is able to 532 outperform the standard full system for the same step size and that both 533 methods provide accurate results. We provide results using an explicit RK3    We now explore the di↵erent costs. Table 5  do not constitute a big portion of the total simulation time, see Figure 7a, 558 thus the advantage for the condensed system is clearly seen in Figure 7b.

559
This is due to the fact that the Jacobian matrix is updated less frequently in 560 this problem, and therefore the relative cost of the solver set-up in the total 561 simulation cost is smaller. For this particular test case and range of poly-562 nomial orders, the solver set-up cost for the full system is cheaper than the 563 theoretical prediction. However, it is still more costly than the condensation 564 cost.

565
It can be seen that the static-condensation method provides the same 566 accuracy up to given tolerance as the full system, but it is up to 40 % faster 567 for the highest polynomial orders (P = 4, 5). As in the previous section, 568 we also present the detailed results of the solver cost, Figures 6a and 6b.

569
Again, the condensed system has more non-zeros nnz (Table 5), but the faster 570 preconditioner-SpMV compensates this cost and leads to faster simulations.
Theoretical and measured preconditioner-SpMV operations for both systems 572 agree well.

573
Finally, we can conclude that our static condensation time-marching 574 method is more e cient for large polynomials, than the full system tech-575 nique, even for unsteady problems, whilst providing accurate results. i N ewton along with number of non-zero entries in full nnz full and condensed nnz cond systems. For all cases considered in the table number of time steps needed to compute one cycle is i t = 280, for polynomial orders P = 2, .., 5.

588
The theoretical estimates agree well with our simulations and provide solid preconditioners, but we note that the cost is lower for the condensed system, 656 since the system size is smaller. Also the di↵erence between the full and the In this section, we compare full and statically condensed systems for a 667 range of Mach numbers, 0.1Ma0.8 and Reynolds numbers, 200Re1000.

668
We use Block-Jacobi preconditioning for all cases. the curvilinear mapping, as will be evident in next sections.

693
A system of nonlinear conservation laws reads where q is the state vector of conserved quantities, numerical flux,f a q ± is its Jacobian with respect to the solution on the element 704 boundary, and q q + is the Jacobian of the boundary condition.

705
The first term of (C.2) generates the densest sparsity. This term can be rewritten using the contravariant fluxes [69] as that is mapped to physical space with high order polynomials The degrees of freedom indexes h and w can be replaced by the tensor product coordinate indexes h (i, j, k) and w (r, s, t). This allows us to rewrite the basis functions as a tensor product combination of Lagrange interpolating polynomials, As a result, (C.3) can be rewritten as where is Dirac's delta function. Equation (C.7) only takes non-zero values if (s = j and t = k) or (t = k and t = k) or (s = j and r = i).
where J ⌫ = @ $ f ⌫ /@q is the Jacobian of the di↵usive flux with respect to q, J m is the Jacobian of the mapping (C.4) at the node m, ! m are the quadrature weights for the volume integral, G = @ $ f ⌫ /@(rq) is the Jacobian of the di↵usive flux with respect torq,q is the numerical trace of the solution on the element boundary,q q ± is the derivative of this numerical trace with respect to the solutions on the element boundary,ñ is the outwardpointing normal vector on the boundary,f ⌫ q + andf ⌫ rq + are the Jacobians of the viscous surface numerical flux with respect to the solution and its gradient, respectively, and @f ⌫ /@q + and @f ⌫ /@(rq + ) are the Jacobians of the viscous surface numerical flux on the physical boundaries. Note that the terms with the subscript contain all the information of the boundary condition on the viscous surface numerical flux: The term first term of the summation in (C.14) is the one that generates the densest sparsity, as it is the multiplication of two volume integrals. This term can be expanded as (C.17) The volume integrals on the left, that depend on the third-order tensors G m , imply two-point connectivities (as in (C.8)) for the degrees of freedom m and h. The volume integrals on the right imply two-point connectivities for the degrees of freedom w and m. In consequence, each degree of freedom h (i, j, k) is connected with non-zeros with all degrees of freedom w (r, s, t) that lie on the same ⇠ ⌘, ⌘ ⇣ and ⇠ ⇣ planes of reference element coordinates. Hence, the number of non-zero entries for in the Jacobian matrix in each diagonal block is It is important to point out that the sparsity pattern generated by (C.17) 710 contains all the non-zero entries needed for the other di↵usive terms and 711 for the advective terms. As can be seen, the di↵usive terms generate dense 712 diagonal blocks in 2D.
Now we estimate the number of non-zeros in the condensed system. Due to the two matrix-matrix products (see Section 3.1) needed to compute the Schur complement, the number of non-zero entries in the condensed system significantly increases. The non-zero entries in each block are constrained by the block size, which has complexity (10) (nb bb = N eq 4P in 2D and nb bb = N eq (6P 2 + 2) in 3D). However, the SpGEMM operations introduce new non-zero entries into the matrix A cond . Additionally, the stencil of the block structure in the Schur complement is wider (non-compact) than in the Jacobian matrix. Therefore, the upper bound for the non-zero entries in the condensed system is 3D: nnz cond = C NeighNeigh N el N 2 eq (6P 2 + 2) 2 , (C.30) 2D: nnz cond = C NeighNeigh N el N 2 eq (4P ) 2 , (C.31) where the constants C NeighNeigh3D = 25 and C NeighNeigh2D = 13 place an