Linear scaling computation of the Fock matrix. V. Hierarchical Cubature for numerical integration of the exchange-correlation matrix

Hierarchical cubature is a new method for achieving linear scaling computation of the exchange-correlation matrix central to Density Functional Theory. Hierarchical cubature combines a k-dimensional generalization of the binary search tree with adaptive numerical integration involving an entirely Cartesian grid. Hierarchical cubature overcomes strong variations in the electron density associated with nuclear cusps through multiresolution rather than spherical-polar coordinate transformations. This unique Cartesian representation allows use of the exact integration error during grid construction, supporting O(log N) range-queries that exploit locality of the Cartesian Gaussian based electron density. Convergence is controlled by τr, which bounds the local integration error of the electron density. An early onset of linear scaling is observed for RB3LYP/6-31G *  *  calculations on water clusters, commencing at (H2O)30 and persisting with decreasing values of τr. Comparison with nuclear weight schemes sugges...

Hierarchical cubature is a new method for achieving linear scaling computation of the exchange-correlation matrix central to Density Functional Theory. Hierarchical cubature combines a k-dimensional generalization of the binary search tree with adaptive numerical integration involving an entirely Cartesian grid. Hierarchical cubature overcomes strong variations in the electron density associated with nuclear cusps through multiresolution rather than spherical-polar coordinate transformations. This unique Cartesian representation allows use of the exact integration error during grid construction, supporting O(log N) range-queries that exploit locality of the Cartesian Gaussian based electron density. Convergence is controlled by r , which bounds the local integration error of the electron density. An early onset of linear scaling is observed for RB3LYP/ 6-31G ** calculations on water clusters, commencing at ͑H 2 O) 30 and persisting with decreasing values of r . Comparison with nuclear weight schemes suggests that the new method is competitive on the basis of grid points per atom. Systematic convergence of the RPBE0/6-31G ** Ne 2 binding curve is demonstrated with respect to r . © 2000 American Institute of Physics. ͓S0021-9606͑00͒32442-4͔

I. INTRODUCTION
Hybrid Hartree-Fock/Density Functional Theory ͑HF/DFT͒ 1-5 has proven to be an important recent development in electronic structure theory, with a high accuracy to cost ratio that has dramatically extended the size and reliability of ab initio calculations in many new areas. Hybrid HF/ DFT methods are economical because they are Self-Consistent-Field ͑SCF͒ or independent particle models, where the fundamental variable is the density matrix P. Recently, O(N) methods for solving various aspects of SCF theory have been introduced that enjoy a computational cost scaling only linearly with system size. While the promise of these methods for performing large scale HF/DFT applications is real, programs able to achieve full linear scaling for HF/DFT have not been forthcoming, due perhaps to a number of difficult bottlenecks that must be overcome simultaneously.
In HF/DFT, the Fock matrix is where h is the core Hamiltonian, J is the Coulomb or Hartree matrix, a 0 ϳ0.20 is a mixing parameter for the exact Hartree-Fock exchange matrix K x HF , and K xc DFT is the exchangecorrelation matrix. Typically K xc DFT is derived from an exchange-correlation functional in the generalized gradient approximation ͑GGA͒, E xc GGA ͓͔, with a 0 of the DFT exchange excluded. [1][2][3][4][5] Due to inefficiencies associated with the implementation of Hartree-Fock exchange in plane wave, grid, wavelet, Slater and numerical orbital methods, the only viable approach for large scale HF/DFT calculations appears to in-volve the use of standard quantum chemical basis sets that employ the Cartesian-Gaussian, Linear Combination of Atomic-Orbital ͑CG-LCAO͒ approximation, [6][7][8][9] for which well developed analytic methods exist to evaluate the relevant two-electron integrals. 10 There are at least four major bottlenecks that must be overcome to achieve linear scaling HF/DFT ͑three for computing F, and one for solving the SCF equations͒. In this series, [11][12][13][14] linear scaling methods for computing CG-LCAO based J and K x HF matrices have been developed, and accuracies comparable to direct SCF 15,16 have been demonstrated for three-dimensional systems. Also, bottlenecks associated with Löwdin orthogonalization 17,18 and Roothann- Hall 19,20 solution of the SCF equations have recently been overcome with sparse-blocked matrix technologies and density matrix minimization, 21,22 likewise demonstrating linear scaling and error control for three-dimensional benchmark suites. On the other hand, methods for linear scaling computation of the CG-LCAO exchange-correlation matrix K xc DFT 23,24 are less well developed, achieving linear scaling only for small basis sets and extremely large or low dimensional systems. 25 Products of the CG-LCAO basis functions, ab (r) ϭ a (r) b (r), are the fundamental participants in formation of the exchange-correlation matrix. In particular, modern algorithms compute the exchange-correlation matrix with where restricted SCF theory has been assumed and the second term disappears in the local density approximation. 26 Because the exchange-correlation functional E xc ͓͔ is in general nonanalytic, the corresponding matrix elements must be computed numerically. a͒ Electronic mail: MChalla@LANL.Gov; URL: http://www.t12.lanl.gov/ home/mchalla/ CG-LCAO basis functions are typically contractions of primitive Gaussian functions: a (r)ϭ ͚ i C i a i (r). Both CG-LCAO basis function products, and the density, may be resolved into primitive distributions , which are linear combinations of Hermite-Gaussian functions involving the same ''mother'' Gaussian with scale factor and center Q. [27][28][29] For typical CG-LCAO basis functions, spans ͓10 Ϫ2 ,10 5 ͔. Also, due to the Gaussian product theorem the coefficient vectors c p and d q decay as Gaussians with separation of the product basis function centers, typically spanning ͓10 Ϫ12 ,10 2 ͔. [27][28][29] Thus, extreme variability in both the range and the magnitude of the suggest a multiscale approach to computation of K xc DFT . The strongly peaked nature of the electron density about nuclear centers requires special treatment, and has inspired sophisticated algorithms for domain decomposition and variable transformation, yielding integration schemes with a minimal number of grid points per atom ͗N Pts ͘. There are at least three main decomposition paradigms: nuclear weight functions, [30][31][32][33][34][35][36][37][38] Voronoi polyhedra, 39 and parallelepipeds. 40,41 These methods all employ a transformation to spherical coordinates, with the Jacobian suppressing nuclear cusps in the electron density. 42 However, these cusps are not strong singularities, and it is interesting to consider an adaptive and purely Cartesian approach to numerical integration, where subdivision achieves arbitrary resolution of strongly peaked domains while sparing effort in regions that are slowly varying. Such a scheme is appealing in the development of linear scaling methods as it suggests a ''fast'' tree-based approach.
This concept is pursued to fruition, and a new linear scaling method for computing the exchange-correlation matrix is presented that fully exploits the mutiscale structure of and ab by avoiding work at the level of primitive distributions. This is accomplished through hierarchical representation of both the density and the integration grid using the k-d tree data structure, [43][44][45] allowing efficient orthogonal range queries to dynamically select only numerically significant contributions. The k-d tree data structure is used to implement a decomposition paradigm that employs recursive bisection and nonproduct Cartesian Gaussian integration rules (C 3 rules͒, resulting in a Hierarchical Cubature able to resolve the strong variations and multiple length scales encountered in numerical integration of the exchangecorrelation matrix.

Hierarchical
Cubature ͑HiCu͒ employs the k-dimensional binary search tree ͑k-d tree͒ [43][44][45] data structure to represent and evaluate both the electron density and the integration grid. The k-d tree associated with the density is the RhoTree, and the k-d tree associated with the grid is the CubeTree. The k-d tree is a k-dimensional generalization of the binary search tree, allowing rapid location and evaluation of minimally overlapping grid and density components.
Each node in a k-d tree includes a k-dimensional hyperrectangle or bounding box ͑BBox͒ that encloses the spatial data contained by it and its children. A range query is a search of the k-d tree for all data within a given Euclidean distance. Performing a range query involves starting at the root of a k-d tree, and recursively searching for BBoxes that overlap a point ͑or a BBox about the point, depending on how the search is formulated͒. Recursive descent of the k-d tree is halted when a nonoverlapping BBox is encountered, while those nodes with overlap are traversed until all leaf nodes within range have been accessed, resulting in an O(log 2 N) search. [43][44][45] The two k-d tree structures employed by HiCu, the Rho-Tree and the CubeTree, are shown in Fig. 1. In the case of the RhoTree, a point-BBox implementation is used, wherein the range is incorporated into the RhoTree by expanding each BBox by a certain extent. Alternatively, the interaction of the CubeTree with a primitive distribution p is formulated in terms of BBox-BBox overlaps.

A. The RhoTree
The electron density is represented by the fivedimensional RhoTree. The five coordinates include three Cartesian directions corresponding to the distribution centers Q, the distribution amplitude, and the distribution scale factors. The RhoTree is constructed by recursive bisection starting at the root, which encompasses the entire density. With each subdivision, a new three-dimensional BBox containing the centers Q is generated, and a new maximum distribution amplitude and minimum distribution exponent are determined. Bisection proceeds until each distribution is resolved into a leaf node. This is shown in Fig. 2. Given a target error in the density, each BBox in the RhoTree is expanded by the amount which is unique to each node. Thus, the contribution a node makes outside its BBox is assumed to be zero. It is now possible to evaluate the density at a given point by starting at the root of the RhoTree and traversing only those nodes having a BBox containing the point. This is efficient because the test for point-box overlap is cheap, because the search is O(log 2 N), and because exclusion of insignificant density components occurs early in traversal as a result of including ordinates for the amplitude and scale factor.

B. The CubeTree
The CubeTree is constructed through recursive bisection of a Cartesian domain enclosing the density, applying a fixed fully symmetric C 3 ͑cubature͒ rule [46][47][48][49] to nodes with in-creasingly small volumes in areas where the integrand varies rapidly. This is similar to conventional methods for the adaptive solution of multiple integrals, 50-52 which employ local error estimates based on sequences of embedded grids, [53][54][55] where rules of different degree are constrained to employ common points.
The differences between HiCu and conventional methods of adaptive integration are ͑1͒ the k-d tree structure is persistent, with each node of the CubeTree containing a BBox; ͑2͒ the exact error is used instead of an estimate, allowing freedom in the choice of a cubature rule.
The accuracy of the HiCu grid is determined by a local error threshold ᮀ ; the CubeTree is extended recursively until the error in each leaf node, Error, is bounded by ᮀ . This approach improves the grid systematically with respect to the electron count N El , but it is not guaranteed to improve with respect to the exchange-correlation energy, which is a global measure of K xc DFT . However, except for very loose thresholds, ᮀ տ10 Ϫ4 , the relative errors in E xc and G xc are found to be of the same order as the relative error in N El . Others have come to similar conclusions, with the relative error ⑀ r in N El taken as a measure of grid accuracy in a number of recent studies. 32,33,35,38 Recursive construction of the CubeTree starts at the root with a BBox that integrates the density to within a requested relative accuracy. For each node ͑Cube͒ in the CubeTree, the RhoTree is traversed, performing a simple test for BBox-BBox intersection. For overlapping BBoxes, contri- In construction of the CubeTree, locality of the density is fully exploited by evaluating only those primitive components that contribute to a grid point. This is accomplished through range queries, which involve walking on the RhoTree performing a check of each node for significance, such that children of insignificant nodes are not traversed. At a higher level, this involves a test for BBox-BBox intersection, and for BBoxes that do overlap, examining the RhoTree children for those that contain a grid point. In evaluation of the exchange-correlation matrix elements, locality of the primitive distributions p is exploited through range queries that involve walking on the CubeTree, testing for BBox-BBox intersection.

FIG. 2. A terminal branch in the RhoTree. The branch node involves a
BBox that encloses the centers Q of two primitive Gaussian distributions ͑solid lines͒ in a minimal way. This minimal BBox is expanded by an amount Extent, given by Eq. ͑7͒, to create an expanded BBox, shown with dashes. Outside the expanded BBox, contributions from the distributions are less than . The branch node is split into two leaf nodes, each containing one primitive distribution with a unique Extent and BBox.
butions to the analytic integral over a Cube's BBox are accumulated and evaluation of the density proceeds at the level of grid points, recurring down the RhoTree testing for individual point-BBox overlaps. This is shown in Fig. 1.

C. Evaluation of K xc
Elements of the exchange-correlation matrix are computed using Eq. ͑2͒ at the level of primitive distributions, p . Each p involves a different Extent, which determines a unique BBox via a straightforward generalization of Eq. ͑7͒. Primitive matrix elements are computed by traversing the CubeTree, checking for BBox-BBox overlap as shown in Fig. 1.

B. Thresholds
Recursive construction of the CubeTree starts at the root with a BBox that analytically integrates the density to within ⑀ r ϭ r* 10 Ϫ1 , where r is the target relative error in N El . This target value is coupled to the local error threshold ᮀ by ᮀ ϭ␦ r r . The value ␦ r ϭ0.2 was found to work well for the systems and basis sets used here, yielding ⑀ r Շ r .
Since the HiCu grid is rebuilt each SCF cycle, it is inefficient to use full precision of the density early in construction of the CubeTree. Therefore, the CubeTree is constructed iteratively, using progressively tighter values of , which controls accuracy of the density according to Eq. ͑7͒. In practice, setting ϭ10 Ϫ3 ᮀ yields results that are indistinguishable from those obtained with the full density. With every iteration, ᮀ is decreased toward its target value, each BBox of the RhoTree is re-expanded, and the HiCu grid is refined. Starting at ᮀ ϭ100 r and progressing to ᮀ ϭ␦ r r in 50 iterations yields significant speed ups.
Matrix elements are computed for atom pairs (A,B) with min ͉AÀB͉ 2 ϽϪlog(10 Ϫ3 r ), and entered block-wise into the BCSR sparse matrix representation of K xc DFT . 21,22 To summarize, a target accuracy r is specified, determining the local absolute cubature error ᮀ . The HiCu grid is constructed iteratively using progressively more accurate values of the density. At maximum resolution, each leaf in the CubeTree has achieved an absolute error less than ᮀ ϭ␦ r r using numerical values of the density that are accurate to within ϭ10 Ϫ3 ᮀ .

C. Implementation
Program HiCu has been coded in FORTRAN90 using object oriented principles, and implemented within the MONDOSCF suite of linear scaling SCF programs. 56 Both the RhoTree and the CubeTree are implemented as doubly linked lists, using pointers as shown in Fig. 3 for the Cube-Tree. This allows dynamic memory allocation and tree traversal with recursive subroutines and functions. It is worth pointing out that significant performance improvements were obtained by minimizing argument lists in recursive calls. This approach has resulted in a very compact and efficient code, in contrast with the integer arrays used to implement doubly linked lists in the FORTRAN77 version of the Quantum Chemical Tree Code. 11,28,29 Evaluation of the density relies heavily on the exponential function EXP, while analytic integration of the density requires the repeated evaluation of both EXP and ERFC. In HiCu, EXP has been implemented using fourth order Chebychev interpolation, and ERFC with third order Chebychev interpolation. Both functions achieve an absolute accuracy of 10 Ϫ12 , and yield a significant performance enhancement relative to full precision of the intrinsic functions provided by the compiler.
All programs have been compiled and run on a 550 MHz PIII running version 6.1 of RedHat LINUX, 57 using version 3.03 of the Portland Group pgf90 compiler 58 with the options -O2 -tp p6 -pc 64 -Mdalign -Mnoonetrip.

IV. RESULTS AND DISCUSSION
Results have been obtained for the standard suite of three-dimensional water clusters (N H 2 O ϭ10,20,30,40,50,60, and 70͒, which have been used in a number of linear scaling studies. 11,12,14,21,22,28,29,[59][60][61] Restricted B3LYP calculations using the VWN3 variation 62 and the 6-31G** basis set were carried out for this sequence of water clusters using r ϭ10 Ϫ4 ,10 Ϫ5 ,10 Ϫ6 ,and 10 Ϫ7 , with timings shown in Fig. 4. The guess density for these calculations was obtained by basis set switch 21 from a converged RHF/STO-3G density matrix. Timings corresponding to the mixed basis set have not been averaged in. After the switch, timings per K xc DFT build were found to be nearly identical, and only values corresponding to the converged density are reported. That is, no averaging whatsoever has been performed. The results in Fig. 4 show a very early onset of linear scaling, which persists for decreasing values of r . In Fig. 5, the HiCu grid obtained with r ϭ10 Ϫ3 is shown for (H 2 O) 70 . The grid points are contained by the FIG. 3. The Cube object. Left is a link to the left child, and Right is a link to the right child, or the next node to traverse in the doubly linked list. Leaf is a logical flag indicating either branch or leaf-node status. IXact is the exact value of the density, integrated analytically over BBox. Grid, Wght, and Vals are allocatable, and hold the grid points ͑3,NGrid͒, the weights ͑NGrid͒, and the values ‫ץ‬E xc ‫ץ/‬ , ‫ץ‬E xc ‫͉"͉ץ/‬ 2 , and " ͑5,NGrid͒. During grid construction these arrays are deallocated when a node is split.
CubeTree root node's BBox, and form a telescoping structure about each water molecule due to peaks in the electron density and multiscale properties of the grid. The nonoverlapping Cartesian properties of the HiCu grid should make fine grained domain decomposition in parallel implementations straightforward.
Unfortunately, direct comparison with the ''industry standard'' LINXC 63 is not possible due to license restrictions, 64 and because of unpublished errors and undisclosed methods for achieving averaged CPU times. How-ever, a qualitative comparison of performance may be achieved by comparison with the results of Ref. 63, wherein linear scaling of J and K xc DFT is not achieved with basis sets and molecular systems nearly identical to the benchmark suites used in this study, even for clusters as large as (H 2 O) 392 . Here, the onset of linear scaling appears between (H 2 O) 20 and (H 2 O) 30 .
A more direct comparison with other methods is possible on the basis of grid statistics. For the water clusters studied here, ͗N Pts ͘ depends on the target accuracy, and approaches an asymptote as shown in Fig. 6. This may be due to surfacevolume effects, as the domain of integration is approximately cubic, while the water clusters are approximately spherical. Thus, as the cluster size increases the interstitial region ͑cube-sphere͒ becomes small relative to the volume of the sphere.
In Fig. 7, the dependence of ͗N Pts ͘ on ⑀ r is plotted, from which the efficiency curve ⑀ r ϭC 10 Ϫn͗N Pts ͘ , ͑11͒ is obtained, allowing a semiquantitative comparison with other methods that have published statistics. In Table I, the values n and C are listed for the HiCu grid and for grids using the nuclear weights of Becke, Delley, and Lin, Jaffe and Hess ͑LJH͒. 38 However, the HiCu values have been acquired using computed worst case errors while the values taken from LJH correspond to target accuracies. Also, different molecular systems and basis sets were used to arrive at these values. It should be noted that HiCu adapts to the basis set, with the size of the HiCu grid depending on the ratio of largest to smallest Gaussian exponent. When this ratio is large, the CubeTree will have to recur to higher resolution to achieve a given accuracy. Nevertheless, on the basis of this comparison, the HiCu grid appears competitive with nuclear weight schemes for low to medium accuracies. A challenge facing all numerical DFT methods is to minimize the magnitude of grid errors, which may lead to discontinuities in the molecular potential energy surface. The HiCu grid error decreases systematically with decreasing values of r , as shown in Fig. 8 for the RPBE0/6-31G** Ne 2 binding curve ͑see Ref. 4 for details of this model chem-istry͒. For r ϭ10 Ϫ6 , the relative error in the total energy is on the order of 10 Ϫ6 , with an absolute error in the binding curve of 10 Ϫ3 hartree. Relative errors in forces should be of the same order as those in the total energy, and for ab initio molecular dynamics an accuracy of 5-6 digits should be sufficient for many applications. At r ϭ3ϫ10 Ϫ8 , errors in the binding curve have decreased by at least an order, and for r ϭ10 Ϫ9 a completely smooth curve obtains.

V. CONCLUSIONS
A new method, Hierarchical Cubature ͑HiCu͒, has been developed for achieving linear scaling computation of the exchange-correlation matrix within the Cartesian-Gaussian LCAO framework. HiCu has demonstrated an early onset of linear scaling for three-dimensional systems, which is persistent with increasing accuracy and for nontrivial basis sets.
HiCu is unique in employing an entirely Cartesian grid for computation of the exchange-correlation matrix, overcoming nuclear cusps in the electron density with a multiscale approach rather than resorting to conventional spherical-polar coordinate transformations. The purely Cartesian representation supports the k-d tree data structure enabling efficient range queries, and provides an exact expression for the integration error at each level of resolution. This nonoverlapping Cartesian representation may also enable the efficient implementation of fine grained parallelism.
One of the most demanding aspects of HiCu is the resolution of nuclear-electron cusps. As the Gaussian LCAO approximation tends toward completeness and the ratio of maximum to minimum Gaussian exponents becomes large, the HiCu grid can be expected to extend itself accordingly. However, as the nuclear-electron cusp is a weak singularity, the CubeTree will not become unbounded. Accordingly, the use of pseudopotentials should lead to substantial gains in efficiency. It may also be possible to employ coordinate transformations within each cube that yield more accurate results than the native rule. During the development of HiCu, a number of one-dimensional transformations were tested for regularization of the integrand, including Pederson's variational mesh approach. 40 These transformations were found to yield significant advantages in the case of a one-dimensional density, but to be ineffective in three dimensions. Threedimensional transformations such as those described in Refs. 65 and 66 may be more effective, but have not been tried.
A number of similarities exist between HiCu and the Quantum Chemical Tree Code ͑QCTC͒ for computation of the Coulomb matrix J. 11,28,29 As the HiCu grid is traversed, work is performed only for overlapping nodes. Likewise QCTC avoids work by invoking the multipole approximation for nodes with insignificant penetration. With a sufficiently with the maximum ͑worst͒ ⑀ r ͑taken over the entire series of water clusters͒. A fit to the last three points is also shown. Using all four points yields ͗N Pts ͘ϭ40 ⑀ r Ϫ.37 , but a less agreeable fit. accurate multipole approximation, the evaluation of penetration acceptability criteria ͑PAC͒ in the current implementation of QCTC dominates when representation of the density proceeds to the finest resolution. The more efficient k-d tree methods developed here for accessing overlapping elements of the density may thus be applied directly to the next generation of QCTC.

ACKNOWLEDGMENTS
This work has been carried out under the auspices of the U.S. Department of Energy under Contract No. W-7405-ENG-36. Thanks to Tommy Sewell and Jeff Hay for a careful reading of the manuscript.