Scalability of OpenFOAM for bio-medical flow simulations

We study a bio-medical fluid flow simulation using the incompressible, laminar OpenFOAM flow solver icoFoam using iterative linear equation solver and direct solvers (kernel class) such as SuperLU_DIST 3.3 and SuperLU_MCDT (Many-Core Distributed) for the large penta-diagonal and hepta-diagonal matrices coming from the simulation of blood flow in arteries with a structured mesh domain. A realistic simulation for the flow of blood in the heart or vessels in the whole body is a complex problem and may take a very long time, thousands of hours, for the main tasks such as pre-processing (meshing), decomposition and solving the large linear systems. Our aim is to test the potential scaling capability of the fluid solver for multi-petascale systems. We started from the relatively small instances for the whole simulation and solved large linear systems. We measured the wall clock time of single time steps of the simulation. This version gives important clues for a larger version of the problem. Later, we increase the problem size and the number of time steps to obtain a better picture gradually, in our general strategy. We test the performance of the solver icoFoam at TGCC Curie (a Tier-0 system) at CEA, France (see [31]). We achieved scaled speed-up for the largest matrices of 64 million  $$\times 64$$ × 64  million in our dataset to run up to 16,384 cores. In other words, we find that the scalability improves as the problem size increases for this application. As the matrix size quadrupled, the speed-up improves at least 50 % near speed-up saturation point. This shows that there is no structural problem in the software up to this scale. This is an important and encouraging result for the problem. Moreover, we imbedded other direct solvers (kernel class) such as SuperLU_DIST 3.3 and SuperLU_MCDT in addition to the solvers provided by OpenFOAM. Since future exascale systems are expected to have heterogeneous and many-core distributed nodes, we believe that our SuperLU_MCDT software is a good candidate for future systems.  SuperLU_MCDT worked up to 16,384 cores for the large penta-diagonal matrices for 2D problems and hepta-diagonal matrices for 3D problems, coming from the incompressible blood flow simulation, without any problem.


Introduction
It is essential to have a fast, robust and scalable library having a potential for multi-petascale systems or future exascale systems to solve a sparse linear system AX=B coming from many science and engineering applications. In this paper, we solve sparse linear systems with large penta-diagonal and hepta-diagonal matrices coming from the simulation of blood flow in arteries with a structured mesh domain. Incompressible, laminar OpenFOAM solver icoFoam is used as the flow solver.
The sloshing of blood in heart is important to understand heart rate turbulence, potential damage of the sloshing to walls of vessels and heart attack. When we examine the sinus rhythm with Q wave and T wave for a human heart seen on electrocardiography, the QT interval is a measure of the time between the beginning of the Q wave and the end of the T wave in the heart's electrical cycle (see [25]). A prolonged QT interval is considered a risk factor for ventricular tachyarrhythmias disorders and sudden death (see [26]). The QT interval is used as one of the input parameters for devices used to regulate the heart, such as cardiac pacemakers. We believe that this study is useful to contribute to the development of artificial blood vessel and temporary or permanent cardiac pacemakers which are sensitive to the sloshing of blood and various behaviours of QT interval.
Several realistic bio-fluid flow simulations have been carried out (see [1][2]). The geometries were extracted from real patients or generated using real patient data from CT or MRI scans. Measured flow rates at the vessel inlet by ultrasound technique is used to get a velocity profile of the simulation geometry inlet (see [1][2][3][4]).
In this study, we describe the speed-up based on variable problem sizes as well. In other words, we not only deal with the linear speed-up for a fixed problem size, but also scaled speed-up which is consistent with the Gustafson's law (see [22] and [23]). Focusing on the entire performance of the many cores globally rather than focusing on particular core efficiencies may provide encouraging results (see also [24]). We achieved scaled speed-up for large matrices up to 64 million x 64 million matrices and speed-up up to 16384 cores on Curie (see [21]). We observed linear speed-up up to 4096 cores on Curie for a 64 million x 64 million matrix. They are significant results for the problem. In literature, there are several scalability results using OpenFOAM for some iterative solvers up to one thousand cores (see [8][9][10][11][12] and references therein) for different applications and studies.
On the other hand, we completed the integration of direct solvers such as SuperLU_DIST (see [16]) and SuperLU_MCDT (see [14] and [15]) into OpenFOAM. We performed the scalability tests for the integration of the direct solvers into OpenFOAM.
The remainder of this work is organised as follows: In Section 2, the test environment is presented. In Section 3, simulation test results with icoFoam and SuperLU_MCDT solvers are discussed. Section 4 summarises this work.

HPC Tools and Test environment
OpenFOAM (see [7]) is an open source Computational Fluid Dynamics (CFD) toolbox, (see [29]). It is useful to simulate complex fluid flows involving turbulence, heat transfer and solid dynamics. It is a generic CFD software package with many tools for several main tasks of the simulation such as pre-processing (meshing), decomposition and solution. Here, the solver refers to not only linear system solver but also Navier Stokes solver and simulator. In this project, specifically, we used the OpenFOAM to simulate blood flow in arteries as an application.
Here, we generated a structured mesh by using blockMesh as the mesh generator. To decompose the generated mesh, we employed decomposePar tool. After the decomposition, we used icoFoam as an incompressible laminar flow simulator/solver tool. It can use several iterative linear system solvers with different preconditioners. Up to now, we tested a preconditioned bi-conjugate gradient linear solver with diagonal incomplete LU pre-conditioner, as an option in the icoFoam solver. 5 or 7 banded sparse matrix occurs at each time step. All the simulations in this study are obtained using the OpenFOAM 2.1.1.
The simulations were conducted on TGCC Curie (a Tier-0 system) (see [21]). It is a Linux environment with InfiniBand QDR Full Fat Tree network and a 100 GB/s bandwith disk system. Each node has 2 eightprocessors Sandy Bridge EP (E5-2680) 2.7 GHz, 64 GB of RAM and one local SSD disk. The simple decomposition method was used for partitioning the mesh into sub-domains. The decomposition of a matrix with the size of 32 million x 32 million elements into 8192 partitions was done in serial and took more than 2 hours. So, parallel decomposition techniques are needed when we increase the matrix size and the number of partitions.

Test results
The total run time of a simple case took about 1500 hours without preconditioning on one core for one period of the cardiac cycle, measured on the Linux Nehalem Cluster (see [28]) available at the National Center for High Performance Computing (UHeM) (see [5]). The run time was reduced to 15 hours with preconditioning techniques on eight cores for one period of cardiac cycle. This was a laminar, rigid wall case with a small portion of the geometry. There were more complex simulations with longer periods (see [2,3,5]) that we run at the Linux Nehalem Cluster (see [28]). Also, we needed at least 10 periods of the cardiac cycle for several cases because the periodic convergence of the case occurs after 10 cycle of the simulation. So the necessary CPU time went up to a few thousand (2000-5000) hours per case. The results of this white paper shows that increasing the mesh size produces a good scale on parallel computing. Table 1 describes a dozen of matrices coming from the simulation of blood flow. For example, mC_1M_D_t is a matrix encountered at the twelfth time step, at time 0.0006 s of the simulation where the time step size is 0.00005. The other matrices are obtained at time 0.00005 of the simulation. mC_8M matrix means 8M of cells in the fluid domain and has matrix size of 8 million x 8 million. Table 1. Description of matrices of different number of nonzeros (NNZ) per row. The matrices come from 2D-3D meshes obtained for simple incompressible blood flow. The structured mesh produces penta or hepta-diagonal matrices. The rest of the matrices has zero elements.
The tests were done for only 1 time step due to time limitations, while the real case runs are conducted for more than millions of time steps. The most time consuming part of the simulation was the decomposing of the mesh. For example, when we considered the decomposing 64M cells of data, it took over 3 hours for 8192 partitions, while it took over 7 hours for 16384 partitions. The decomposition was run on 1 core since blockMesh does not support parallel decomposition. Meshing time was the same for all partition numbers. The total meshing time for 64M cells (i.e. having problem matrix size of 64Mx64M) was about one hour. The simple decomposition method was preferred since the running cases were for a structured mesh. This technique simply splits geometry into pieces by direction, such as 32 pieces in x direction and 32 pieces in y direction. When the geometry is more complex and an unstructured mesh is used, more advanced techniques can be selected such as METIS (see [18]) or Scotch (see [19]). Also, since the mesh is structured, mC_64M matrix means 64M of cells in the fluid domain.  Table 2; 32 million x 32 million up to 8192 cores in Table 3; and 64 million x 64 million up to 16384 cores in Table 4 Linear Speed-up core numbers and the more memory necessity for small number of cores can affect negatively in wall clock time and they cause superlinear speed-up, up to 2048 cores, since the more consuming time than the expected value on numerator. Another reason for the superlinear speed-up is due to the nonlinear characteristics of the iterative solver and the pre-conditioner performance. Moreover, we observe that the scalability becomes better as the problem size increases. Figure 2 illustrates this scaled speed-up for large matrices having sizes up to 64 million x 64 million and up to 16384 cores.

SuperLU_MCDT solver results
SuperLU_MCDT is a distributed direct solver and the software will be uploaded to website (see [17]) after academic permissions from Istanbul Technical University. Here, we used symbolic factorization, ParMETIS (see [18]) for column permutation and Intel MKL (see [27]) as the BLAS library, among several options. The tuning of super-nodal storage parameters is important for the performance and we selected the tuned parameters relax:100 and maxsuper:110 (see [14]). Table 5 illustrates the time for the factorisation and the total time for each matrix.
We define an optimal minimum number of cores as the number of cores that provides the minimum wall clock time for a given size of problem, where a right match occurs between the problem size and the available resources such as memory, in presence of communication overhead. We find that the optimal minimum number of cores required depends on the sparsity level and size of the matrix. As the sparsity level of matrix decreases and the order of matrix increases, we expect that the optimal minimum number of cores increases slightly. For example, while 512 cores is the required minimum number of cores for mC_8M, mC_16M, mC_1M_D and mC_2M_D matrices, 2048 cores provides minimum wall clock time for mC_4M_D, mC_5M_D, mC_6M_D and mC_8M_D matrices in this portfolio of matrices.  Overall, we observed that the wall clock time with SuperLU_MCDT is longer than that of icoFoam solver, as expected, because generally direct solvers take longer time than iterative solvers. For example, icoFoam solver outperforms SuperLU_MCDT for mC_8M matrix (see Figure 3). We obtained almost similar results with SuperLU_DIST 3.3 for this set of matrices. We find that the communication overhead coming from ParMETIS [18] becomes one of the dominating factors in the distribution of wall clock time on the large sparse matrices for certain large numbers of cores, for example greater than 256 cores, depending on the pattern, sparsity level and order of matrix, consistent with the results of Duran et al. [13]. For example, Table 6 shows the distribution of wall clock time (s) for mC_8M matrix and the impact of number of supernodes and the communication overhead coming from ParMETIS on the performance. We obtained similar results for the other matrices in Table 1. Figure 4 and Figure 5 show the performance of SuperLU_MCDT for mC_1M_D_t and mC_8M_D, respectively.

Summary and future work
In this study, we tested the scalability of the existing OpenFOAM solver icoFoam and SuperLU_MCDT for various sizes of penta-diagonal and hepta-diagonal matrices coming from the simulation of blood flow in arteries with structured mesh domain. We observe speed-up up to 16384 cores for the largest matrix in our tests using icoFoam. Moreover, we find that the scalability becomes better as the problem size grows. Figure 2 shows this scaled speed-up for large matrices having sizes up to 64 million x 64 million and up to 16384 cores. We observed linear speed-up up to 4096 cores on Curie (see [21]) for a 64 million x 64 million matrix.
In general, we obtained reasonable performance results with SuperLU_MCDT, in addition to the advantages of the direct solvers for robustness.
In order to show better usability of our direct method compared with iterative methods in blood flow simulations, the coefficient matrices with high condition number in transient flow conditions need to be tested. It is important to observe that, for high condition numbers, the time difference between direct and iterative solvers for the solution of linear set of equations will be reduced. For more complex flow conditions the solution time of iterative solvers will increase based on fixed solution precision. In this case, the cost of direct methods is still fixed and the potential gap is expected to be reduced. In order to show better usability of our many-core enabled direct method compared with iterative methods in blood flow simulations, the coefficient matrices with high condition number in transient flow conditions have to be tested. It is well known fact that during the flow simulations both coefficient matrices and right hand side vector are changing. This change is especially drastic during the severe flow dynamics conditions in simulation. This drastic change, in most cases, shows itself as an ill-conditioned spectral space and high condition numbers. It is important to observe that, for high condition numbers, the time complexity gap between direct and iterative solvers for the solution of linear set of equations will be reduced. For more complex flow conditions the solution time of iterative solvers will increase based on fixed solution precision. In this case direct method's cost is still fixed and the potential gap in solution time is expected to be reduced. It is also worth noting that iterative methods works on both coefficient matrix and the right hand side vector changing at each time step but our direct solver works only on coefficient matrix. This is also a potential advantage for our direct solver in case of large simulation times.
Our many-core aware direct sparse solver has a capability of exploiting the potential benefits of many-core distributed systems than any other sparse direct solvers especially for unsymmetric matrices. Future exascale systems are expected to be having heterogeneous and many-core distributed nodes. We believe that our SuperLU_MCDT software is a good candidate for these future systems with its scalability on the servers we tested. While SuperLU_MCDT worked up to 16384 cores for the large sparse matrices coming from the incompressible blood flow simulation without any problem, there was observable performance gain up to 2048 cores for the sufficiently large matrices, for example, mC_4M_D, mC_5M_D, mC_6M_D and mC_8M_D matrices having 7 number of nonzeros per row. Potential challenges for our many-core aware software is resilience, accelerator support and hyper graph partitioning for both better scalability and sustainability. We make efforts to minimize the communication overhead coming from ParMETIS for large number of cores and search for alternative solutions.
As a future work we will test icoFoam simulator with other iterative solvers such as generalised geometric algebraic multi-grid and incomplete Cholesky preconditioned conjugate gradient. Also, we will test other flow simulators such as nonNewtonianIcoFoam or pisoFoam. nonNewtonianIcoFoam is used to simulate nonnewtonian flows while pisoFoam is for incompressible turbulent flow.