EFFICIENT ALGEBRAIC MULTIGRID PRECONDITIONERS ON CLUSTERS OF GPUS

Many scientiﬁc applications require the solution of large and sparse linear systems of equations using Krylov subspace methods; in this case, the choice of an eﬀective preconditioner may be crucial for the convergence of the Krylov solver. Algebraic MultiGrid (AMG) methods are widely used as preconditioners, because of their optimal computational cost and their algorithmic scalability. The wide availability of GPUs, now found in many of the fastest supercomputers, poses the problem of implementing eﬃciently these methods on high-throughput processors. In this work we focus on the application phase of AMG preconditioners, and in particular on the choice and implementation of smoothers and coarsest-level solvers capable of exploiting the computational power of clusters of GPUs. We consider block-Jacobi smoothers using sparse approximate inverses in the solve


Introduction
Many engineering and scientific applications require the solution of linear systems where the matrix A is large and sparse.In our context, large means millions or even billions of unknowns, whereas sparse means that most coefficients in A are zero and it is convenient to employ special storage formats.The most widely used solvers for systems of this type are Krylov subspace methods [1].Their time to solution is determined by the number of iterations performed and by the time per iteration, and hence their efficiency is the result of a tradeoff between iteration complexity and speed of convergence.A critical feature is the application of preconditioning, corresponding, e.g., to a transformation of the form which is aimed at speeding up the convergence of Krylov methods through the improvement of the "quality" of the system matrix.Note that the product M −1 A is not computed explicitly, but the Krylov solvers are modified so that it is obtained by performing at each iteration the computation of M −1 v, where v is a suitable vector.
Multigrid methods are among the most efficient numerical preconditioners for solving large-scale systems of equations.In particular, they are optimal, in the sense that their computational cost grows linearly with the number of unknowns, when the linear systems come from the discretization of elliptic partial differential equations [2].Here we focus on the Algebraic MultiGrid (AMG) approach, which, unlike the geometric one, does not make explicit use of information about the problem which the linear system comes from, but exploits only the linear system, with the goal of achieving methods that can be applied to wide classes of problems [3,4].Furthermore, we consider systems with symmetric positive definite matrices, which until now have been the main target of the AMG research activity.
The linear complexity of AMG preconditioners generally translates into algorithmic scalability, i.e., the number of iterations of AMG-preconditioned Krylov solvers does not depend on the size of the problem.This allows efficient parallel implementations on multiple CPUs, for example through a domain decomposition approach, in which rows of the matrix are assigned to different computing nodes.Because of their linear complexity and parallelization scalability, AMG preconditioners are expected to be methods of choice in the emerging exascale scenario [5].
The diffusion of General Purpose Graphics Processing Units (GPGPUs), currently found in many of the fastest supercomputers in the Top 500 list (https: //www.top500.org/lists/),requires the exposition of a high degree of parallelism, because GPUs are high-throughput many-core processors.Since AMG methods are obtained by combining different components (smoother, coarsening algorithm, coarsest-level solver, restriction and prolongation operators), a full exploitation of GPU capabilities requires each component to be optimized for this type of architecture.
We focus on the application phase of AMG preconditioners, and in particular on the choice and implementation of AMG smoothers and coarsest-level solvers capable of harnessing the computational power offered by a cluster of GPUs.We consider block-Jacobi smoothers and solvers that use sparse approximate inverses to perform the local solves required by each block, instead of the usual factorization methods.The choice of sparse approximate inverses is motivated by the much larger amount of parallelism exposed by sparse matrix-vector products as compared to the parallelism available in sparse triangular solves.Furthermore, suitable implementations of sparse approximate inverse preconditioners have already proved their efficiency on a single GPU (see, e.g., [6,7]).In our work, the smoothers and local solvers are used within the AMG framework offered by the MLD2P4 package of preconditioners [8,9,10] and exploiting sparse matrix data structures and Krylov solvers from the PSBLAS library [11].We conducted weak scalability tests on the JURECA supercomputer at the Jülich Supercomputing Centre (JSC), obtaining good weak scalability on up to 128 GPUs and 256 million equations.
Previous research efforts have been devoted to study AMG on GPUs, showing the acceleration that can be achieved by exploiting these devices (see, e.g., [12,13,14,15,16,17,18]).However, to the best of our knowledge, the AMG smoothers and local solvers we employ in this work have not been considered in other AMG preconditioners tailored for GPUs.Furthermore, in most cases, only a single GPU or small-scale GPU clusters have been considered.
In this work we do not consider the setup phase of the preconditioner, which will be the subject of future work.Of course, an efficient AMG setup on multiple GPUs is important for the overall efficiency of the preconditioner.Nevertheless, in situations where we need to solve multiple linear systems with the same coefficient matrix or sequences of linear systems with slowly varying matrices that allow reuse of the preconditioner, it is desirable to obtain the best possible efficiency during the solution phase even at the expense of a greater setup time, because this will be amortized over multiple solution steps.
The remainder of this article is organized as follows.In Section 2 we briefly describe AMG preconditioners, identifying their main sparse-matrix kernels.In Section 3 we discuss the main issues concerning the implementation of these kernels on GPUs.In Section 4 we illustrate the behaviour of the AMG preconditioners on a cluster of GPUs using linear systems coming from a groundwater modelling application.Finally, we provide some concluding remarks in Section 5.

AMG Preconditioners
Multigrid methods achieve their efficiency by the recursive application of two complementary processes: relaxation and coarse-grid correction.The relaxation (or smoothing) consists in the application of an iterative method, e.g., Jacobi or Gauss-Seidel, to reduce highly oscillatory error components, while the coarse-grid correction corresponds to the solution of the resulting residual equation in an appropriately chosen coarse space, aimed at reducing the leftover error components.This procedure is applied recursively, generating a sequence of spaces and corresponding residual equations of smaller and smaller size.
In the geometric multigrid approach, the coarse spaces and the associated restriction and prolongation operators, needed for transferring information from a space to the next coarser one and vice versa, are defined by the geometry of the problem.Conversely, AMG methods build the hierarchy of spaces and the corresponding transfer operators by performing an algebraic coarsening process, which uses only the entries of the system matrix, without assuming explicit knowledge of the problem which the linear system originates from.We consider an algebraic coarsening procedure based on the aggregation technique, where coarse-space unknowns are aggregates of the original unknowns.In particular, the AMG preconditioners available in the MLD2P4 library [10] use a decoupled version of the smoothed aggregation algorithm described in [19,20].Since the implementation on GPUs of the setup of the AMG hierarchy is beyond the scope of this work, we refer the reader to [21] for details on this aggregation algorithm and its current implementation in MLD2P4.
Here we briefly describe the so-called application phase of the AMG preconditioner, also known as multigrid cycle, which is the objective of our implementation on clusters of GPUs, using the notation introduced below. Let be the hierarchy of matrices resulting from the coarsening procedure, where A k has dimension n k , with n 1 > n 2 > . . .> n lev , and n 1 = n, where A and n are defined in (1).We assume that the matrices have been built by using the Galerkin approach, i.e., for each level k < nlev, where is a linear prolongation operator from level k + 1 to level k, and its transpose is a restriction operator from level k to level k + 1.This is a common choice in AMG Algorithm 1 (V-cycle) if (k = nlev) then 3: 5: x k = x k + P k x k+1 7: else 9: end if return x k 12: end procedure methods and is used in the MLD2P4 library.Finally, we denote by S k ∈ R n k ×n k the relaxation operator at level k, i.e., the pre-smoothing step has the following form: with obvious meaning of x k and b k , and the post-smoothing step has the same structure as in (6), with S T k in place of S k .For the sake of simplicity, in (6) we describe the application of a single smoothing iteration.Different multigrid cycles can be obtained by combining the previous operators.The most widely used one is the so-called V-cycle, described in Algorithm 1.
Several relaxation methods can be chosen, defined by operators S k with different characteristics.For example, in the Jacobi method S k = diag(A k ) −1 , where diag(A k ) is the diagonal matrix with diagonal entries equal to the diagonal entries of A k ; in this case, the application of S k corresponds to a highly parallel vector update operation.More robust iterative methods, such as Gauss-Seidel and block-Jacobi, are often needed to obtain effective preconditioners; however their application requires a kernel to solve sparse triangular systems, and such a kernel is not well suited to effective implementation on GPUs (see the next section for more details).Block-Jacobi iterations also require the factorization of sparse matrices, which can be done once during the setup of the preconditioner.We also note that the solution of the coarsest-level system (k = nlev) must be carefully addressed in parallel settings.Direct solvers usually lead to preconditioners that are more effective in reducing the iterations of Krylov solvers, but this does not guarantee parallel efficiency.On the other hand, the iterative methods usually applied as smoothers can be used as coarsest-level solvers too, with the aim of achieving a better tradeoff between preconditioner quality and parallel performance.

AMG on GPUs
The term GPGPU was first introduced in connection with devices produced by NVIDIA Co.The NVIDIA GPUs are made up of scalable arrays of multithreaded, streaming multiprocessors.Each multiprocessor is composed of a fixed number of scalar processors, one or more instruction fetch units and on-chip fast memory, as illustrated in Figure 1.The host computer is connected to the GPU device using a bus, as shown in Figure 2, normally having a much smaller bandwidth than that of the device global memory.The programming model proposed by NVIDIA is Single host, and a kernel program that executes on the GPU device.Each thread has an identifier, and each vector instruction is executed on a set of threads called a warp.
In the sequel we will present results obtained on a parallel machine where each node is equipped with a standard (multicore) CPU and one or more GPUs.In our implementation there is one of the CPU cores acting as the host for each GPU device.
The overall data parallelization strategy is dictated by the MLD2P4/PSBLAS programming environment, i.e., a generalized row-block distribution is used [10].In particular, the parallel matrix-vector products require on each process the availability of the so-called halo data, i.e., vector entries located on different processes but whose values are necessary to complete the local computation, as specified by the sparsity pattern of the matrix.From the description of the multigrid cycle in Section 2, it is clear that its efficient implementation on clusters of GPUs depends on the efficient implementation of the following computational kernels: • vector sum; • sparse matrix by vector product; • relaxation and coarsest-level solve.
Note that the sparse matrix by vector product is involved not only in the transfer of data across the levels of the AMG hierarchy, but also in the usual computation of residuals in a Krylov method.
The GPU implementation of these kernels is handled through a plugin for the underlying PSBLAS library [11,22].As discussed at length in [22], an efficient implementation of these computational kernels cannot be achieved unless we restrict data transfers between the GPU device and the CPU host.Each GPU-enabled data structure (vector or sparse matrix) is equipped with memory allocated on both the host (CPU) and the device (GPU); at the start of the computation the contents of the vector or sparse matrix are usually only available on the host, but, as soon as a computational kernel is invoked, the data is copied to the device side and is subsequently kept there.All supported operations are executed purely on the GPU; vector and matrix data remain on the GPU unless the program explicitly requires to extract an external copy.Thus, for most of the computation, the only data that are regularly transferred from GPU to CPU and vice-versa include: • scalars specified on the CPU and passed to the GPU for the execution of vector sums of the type y ← αx + βy and matrix-vector products of the type y ← αAx + βy; • portions of vectors to implement the halo data exchange.
The sparse matrix by vector (SpMV) product is a fundamental kernel of the multigrid cycle.It is well known to be a memory-bound kernel, and its efficient implementation is the subject of an immense amount of research.For a thorough discussion of the software design of the SpMV kernel techniques we refer the reader to [23] and the references therein.In our experiments we use variants of the ELL-PACK storage format, in particular the HLL format described in [23,24] and implemented in the PSBLAS GPU plugin.
The choice and the implementation of the smoother is the most critical issue of the multigrid cycle on clusters of GPUs.Previous experience on clusters of CPUs has shown that the use of block-Jacobi iterations, either as smoothers or coarsestlevel solvers, allows to obtain AMG preconditioners that achieve a good balance between acceleration of Krylov solvers and parallel performance [9,25,26,27,28].In this case the relaxation operator is the block-diagonal matrix where A i k is the diagonal block of A k corresponding to the block of rows assigned to the compute node i in the data distribution.This requires the inversion of each block A i k , which can be carried out by performing a Cholesky factorization followed by triangular solves.In the context of preconditioning, a sparse incomplete factorization is usually computed, obtaining an approximation of A i k .As already observed, the incomplete factorization is carried out only once, during the setup phase of the preconditioner, while the triangular solves must be performed at each relaxation and coarsest-level solution.
Unfortunately, the efficient implementation of sparse triangular solves on GPUs is a difficult task.For example, in [29] it is reported that Krylov solvers preconditioned with the incomplete LU and Cholesky factorizations available in the CUDA cuSPARSE library achieve on average only a speedup of 2 over the CPU implementation provided by the Intel MKL library; furthermore, the sparsity pattern of the matrix strongly affects the performance of the triangular solves.The difficulties associated with sparse triangular solves on GPUs are confirmed by the study in [30], where it is also shown that the performance of a sparse triangular solver on a GPU is strongly dependent on the features of the system matrix, while it is almost independent of the floating-point precision or the GPU architecture employed.The main problem is that GPUs require a massive amount of data parallelism to be exploited, and the sparsity of the triangular factors produces strong data dependencies that limit the amount of available parallelism and yield unbalanced computations.Moving towards a denser matrix would improve the triangular solve, but would be undesirable in terms of memory footprint, time to setup and time to apply the preconditioner.
Since the sparse matrix-vector product is much more amenable to an efficient implementation on GPUs, it is quite appealing to use sparse approximate inverses for the local blocks of the block-Jacobi methods, with Z i k upper triangular and D i k diagonal, since their application requires matrixvector products.
The approximate inverses are computed with a plugin for MLD2P4 that has been described in [6,31].In particular, implementations of the following methods tained by combining the continuity equation with Darcy's law; a homogeneous permeability tensor is considered and no-flow boundary conditions are imposed.The discretization is performed by a cell-centered finite volume scheme (two-point flux approximation) on a Cartesian grid and the resulting matrix is symmetric positive definite, with nonzero entries distributed over seven diagonals [35].The computational domain is uniformly partitioned along the three coordinate planes, obtaining a row-block distribution of the matrices which minimizes the surface-to-volume ratio.We performed weak scalability tests, keeping approximately 2 million equations per process.The results obtained are representative of the behaviour of the AMG preconditioner on linear systems coming from more general isotropic elliptic equations.
The experiments were carried out using the cluster module of the JURECA supercomputer at JSC, comprising 75 compute nodes with 2 NVIDIA Tesla K80 GPUs with a dual-GPU design, 1872 compute nodes with 2 Intel Xeon E5-2680 v3 Haswell CPUs per node, and a Mellanox EDR Infiniband network.PSBLAS 3.5, combined with the extended matrix formats and GPU plugins, and MLD2P4 2.1, combined with the AINV plugin, were installed using the GNU 5.4.0C and Fortran compilers, MVAPICH2 2.3 and CUDA 8.0.61.The tests were run using both CPUs and GPUs, as well as CPUs only, to quantify the gain in efficiency due to the use GPUs.Up to 128 GPUs were used for our experiments, thus the dimensions of the linear systems range from 2 million to 256 million.
The linear systems were solved using the GPU implementation of the Conjugate Gradient (CG) method provided by PSBLAS with the GPU plugin, coupled with two versions of the V-cycle preconditioner: one using 1 sweep of the block-Jacobi method as pre-smoother and as post-smoother, with INVK(0,1) as local solver, and the other using 2 sweeps of the simple Jacobi method in each smoothing phase.In the following, we refer to the previous smoothers as BJAC(INVK) and JAC2, respectively.In both versions, 10 sweeps of BJAC(INVK) were applied as coarsestlevel solver.The multilevel hierarchy was built by running (on CPUs) the decoupled smoothed aggregation algorithm mentioned in Section 2, using its default parameters.The resulting number of levels was 4, except for 128 processing elements (i.e., for the largest matrix), where 5 levels were obtained.The zero vector was chosen as starting guess and the CG iterations were halted as as soon as the ratio between the 2-norm of the residual and the 2-norm of the right-hand side became smaller than 10 −6 .The HLL sparse matrix format was used when running the experiments on CPUs plus GPUs, while the CSR format was used for tests on CPUs only.
Figure 3 shows the execution times, in seconds, required by the solution of the linear systems on GPUs and on CPUs, using the two V-cycle versions described above.We see that on GPUs the use of BJAC(INVK) and JAC2 as smoothers leads to about the same execution time, while BJAC(INVK) appears more expensive on CPUs.Furthermore, there is a significant gain in using GPUs vs CPUs: the speedup over CPUs ranges from 5.9 to 10 when BJAC(INVK) is used as smoother and from 5 to 9.4 when JAC2 is applied.In our opinion, this is a satisfactory result, given the Efficient Algebraic Multigrid Preconditioners on Clusters of GPUs 11 memory-bound nature of the computation and the decreasing size of the matrices as the AMG hierarchy is traversed, which decreases the amount of computation on GPUs and increases the percentage of data to be exchanged among the processing elements.
To better compare the two V-cycles, we report, in Table 1, the number of CG iterations obtained with the two versions of V-cycle as the number of processing elements varies.Of course, the number of iterations is the same for the GPU and the CPU implementations, since the same coarsening strategy is run on CPUs in both cases.We observe that the number of iterations corresponding to BJAC(INVK) is smaller than the number of iterations with JAC2.This compensates for the larger time required by a single smoothing step with BJAC(INVK).We expect that BJAC(INVK) is more efficient with linear systems requiring more robust smoothers.We also note that the increase in the number of CG iterations is modest when go- ing from 1 to 64 processing elements; furthermore, there is a slight decrease on 128 processing elements.This confirms that the use of BJAC(INVK) iterations at the coarsest level does not deteriorate too much the effectiveness of the preconditioner as compared to the application of a direct solver, while it allows good scalability.The small reduction of the iteration count on 128 processing elements depends on the higher density of the diagonal blocks at the fifth level of the AMG hierarchy, which leads to better approximations by INVK(0,1).We also forced the coarsening algorithm to generate 4 levels on 128 processing elements, obtaining a larger coarsest matrix.In this case, BJAC(INVK) was less effective at the coarsest level, leading to an increase in the number of CG iterations.However, while the time on CPUs increased, the time on GPUs became slightly smaller, because the diagonal blocks of the larger coarsest-level matrix expose more parallelism in matrix-vector products on GPUs.

Conclusions
We have presented an implementation of the application phase of AMG preconditioners for clusters of GPUs, obtained by integrating efficient sparse-matrix kernels for GPUs in the framework of the MLD2P4 package.The main issue has been the choice of smoothers and coarsest-level solvers able to achieve a good tradeoff between exploitation of the computational power of GPUs and communication among the processing elements.The results obtained on linear systems from a groundwater simulation, made available within the Horizon 2020 EoCoE project, show from 5× to 10× increase of speed versus CPUs and good scalability up to 128 GPUs and 256 million equations.Future work will be devoted to the implementation of the AMG setup phase, focusing on the coarsening algorithm and the construction of the hierarchy of matrices.
It is worth noting that the integration of GPU-tailored sparse-matrix kernels in the modular architecture of MLD2P4 has led to good results without the need to restructure the application phase of the AMG preconditioners.Although a complete Efficient Algebraic Multigrid Preconditioners on Clusters of GPUs 13 re-coding in CUDA of the AMG preconditioners can in principle lead to higher efficiency in exploiting GPU capabilities, we believe that our approach is quite useful, because it allows running on GPU devices with close-to-optimal efficiency while at the same time making optimal reuse of existing software.

Figure 3 .
Figure 3. Solution of the linear systems using CG with the two versions of V-cycle: weak scaling on GPUs and CPUs (top) and speedup of GPUs vs CPUs (bottom).

Table 1 .
CG iterations with the two versions of V-cycle.