Active-learning-based surrogate models for empirical performance tuning

Performance models have profound impact on hardware-software codesign, architectural explorations, and performance tuning of scientific applications. Developing algebraic performance models is becoming an increasingly challenging task. In such situations, a statistical surrogate-based performance model, fitted to a small number of input-output points obtained from empirical evaluation on the target machine, provides a range of benefits. Accurate surrogates can emulate the output of the expensive empirical evaluation at new inputs and therefore can be used to test and/or aid search, compiler, and autotuning algorithms. We present an iterative parallel algorithm that builds surrogate performance models for scientific kernels and workloads on single-core and multicore and multinode architectures. We tailor to our unique parallel environment an active learning heuristic popular in the literature on the sequential design of computer experiments in order to identify the code variants whose evaluations have the best potential to improve the surrogate. We use the proposed approach in a number of case studies to illustrate its effectiveness.


I. INTRODUCTION
Automatic performance tuning is a response to the evergrowing complexity of high-performance computing (HPC) architectures and difficulties of manual tuning of scientific applications. Autotuning consists of identifying relevant code optimization techniques, assigning a range of parameter values using hardware expertise and application-specific knowledge, and then either exhaustively evaluating or searching this parameter space to find high-performing parameter configurations for the given machine.
The primary goal of performance models in autotuning has been to avoid running the corresponding code configuration on the target machine by predicting performance metrics of a given parameter configuration [1]. Such models are attractive, for example, for the design and development of search algorithms. Models can help prune large-scale search spaces [2], and their ease of evaluation allows for rigorous experimental comparisons among different search algorithms. Autotuning without models is costly because it requires running many different code variants on the target machine. Models also provide low setup cost for search algorithm developers who are typically not experts in autotuning. Moreover, accurate performance models allow researchers to perform large-scale experiments without requiring access to HPC systems except during the final phase of experimental analysis [1]. However, developing performance models is increasingly difficult because of the complexity of the HPC architectures and scientific codes [2], [3].
Analytical performance models, which use closed-form mathematical expressions for predicting performance metrics, have enjoyed significant success in the compiler optimization community for accelerating serial codes. However, this approach is limited by the quality and extrapolatory power of the mathematical model, which often fails to capture complex interactions between the code, runtime systems, and architecture. Moreover, developing a complex mathematical model requires a wide range of expertise in the target system architecture, programming models, and scientific applications. Consequently, analytical models are less well suited for highly specialized kernels and libraries for scientific applications that require portability, scalability, and performance [4], [5], [6].
When analytical performance models become too restrictive for a given scientific workload and HPC architecture, empirical performance modeling is an effective alternative [6], [7]. In this approach, a small subset of parameter configurations (code variants) is evaluated on the target machine to measure the required performance metrics, and a predictive model is built by using statistical approaches. We refer to these approximate models as surrogate models. Our focus in this paper is on empirical-based modeling, where we predict the outputs of new configurations on a single target machine, a task that we distinguish from performance modeling aimed at predicting outputs of a single configuration on new architectures.
Surrogate models for HPC workloads and kernels on CPUbased architectures have been based primarily on machine learning approaches [8], [9], [10]. Similar models for scientific kernels on multicore architectures have also been examined [3], [4], [11]. Artificial neural networks have been used to model power draw, execution time, and energy usage [5]. Boosted regression trees were adopted in [6] for obtaining online surrogate models for a GPU implementation of a spatial image filtering kernel. In all these works, the focus is the deployment of various algorithms for performance prediction. In a cluster computing environment, a naïve way to build any of these surrogate models is to sample a large number of configurations uniformly at random, evaluate these in parallel, and fit a model. Although such an asynchronous, embarrassingly parallel approach may seem like a holy grail in terms of work scalability, it can result in poor resource utilization due to poor model quality relative to the work required.
Faced with computationally expensive evaluations, a customary approach for developing surrogate models consists of sequential evaluation of parameter configurations. At each iteration, a model is fitted to all previously seen evaluations and is then used to decide which parameter configuration to evaluate next. In performance modeling, the sequential approach has been used to build surrogate models for resource allocation on networked utilities [12]. That work used a Plackett-Burman design [13], whose general applicability is limited because its linear-regression-based models are not appropriate for nonlinear response functions. Note that the sequential approach has been studied in the fields of statistics, applied mathematics, and machine learning, where it falls under the umbrella of design of experiments [14], simulation optimization [15], and active learning [16], respectively. Such an approach cannot take advantage of massively parallel environments.
The main difficulty of deploying active learning in parallel environments is the a posteriori importance: Given a model and a set of unevaluated parameter configurations, the active learning approach can query the model to select a subset (rather than a singleton) of the most-informative parameter configurations for parallel evaluation. However, as soon as a parameter configuration in the subset gets evaluated, the other configurations in the subset can become significantly less informative. Consequently, evaluating such configurations will likely not result in improving prediction accuracy and may therefore represent wasted evaluations.
In this paper we propose an iterative parallel algorithm that builds surrogate models for performance modeling. The novelty of the proposed approach consists of evaluating the parameter configurations in a sequence of batches to make use of multicore and multinode cluster computing environments in order to reduce the overall time required to obtain high-quality, fitted surrogate models. We address the a posteriori importance issue by tailoring the active learning to select a configuration based on prediction accuracy informed by other configurations to be evaluated in the same batch.
Active learning as a data acquisition scheme in surrogate modeling is still in its infancy. The closest related work [17] is from the design of experiments literature, where active learning is used to model the multiple outputs from a computational fluid dynamics simulator. This approach uses a nonlinear, treed Gaussian process (GP) modeling approach and takes into account asynchronous and batch mode evaluations. The approach we adopt uses dynamic regression trees, which have recently been shown to be more effective than treed GPs [18].
From the performance-modeling perspective, the main contributions of the paper are as follows. Previous work on surrogate models focused primarily on the adoption of machine learning approaches. To our knowledge, this is the first work on the design of a data acquisition strategy for building performance models with the objective of efficiently using HPC systems and minimizing the number of expensive evaluations on the target machine. Furthermore, most existing work in surrogate modeling deals with single-node architectures. In this paper, we show that our surrogate models can predict performances on massively parallel, leadership-class machines.

II. DYNAMIC TREES
Dynamic trees [18] can be seen as regression trees combined with Bayesian inference. The former is a classical nonlinear regression approach that recursively partitions a multidimensional input space into a number of hyper-rectangles such that inputs with similar output values fall within the same hyper-rectangle. This partitioning scheme gives rise to a set of if-then-else rules that can be represented as a tree. Bayesian regression trees [19] are specified by a prior distribution on how the input space can be recursively partitioned and a likelihood comprising a product of simple, independent regression models applied in each partition. Together, these define a posterior distribution on the output space. Samples from the posterior distribution may then be obtained by simulation schemes such as Markov Chain Monte Carlo.
Dynamic trees specify a similar process for how trees evolve as new data arrive. At time t, after having seen data (x, y) t ≡ (x 1 , y 1 ), . . . , (x t , y t ) and inferred a tree T t |(x, y) t , a simple set of stochastic rules defines which T t+1 may be considered when (x t+1 , y t+1 ) arrives. In this process, the new T t+1 must be identical to the old T t except near the leaf node η(x t+1 ) containing x t+1 . The process stochastically chooses from three local modifications based on support from y t+1 in the posterior distribution: keep η(x t+1 ) unchanged in T t+1 ; grow a new split, making η(x t+1 ) a parent of two new leaves in T t+1 ; or prune the tree to make the parent of η(x t+1 ) a leaf in T t+1 . A particle approach-essentially applying these rules independently to many similar trees grown stochastically on the same data-can reduce Monte Carlo error (via averaging) and lead to more accurate uncertainty quantification (by studying the spread of trees). Taking a sequential Monte Carlo, or "filtering," approach that appropriately couples the particles/trees can offer further statistical efficiency gains. Dynamic tree models have been shown to be competitive with several related nonparametric regression schemes in outof-sample prediction exercises, both on batch data (with random data orderings) [18] and in online settings [20]. They have also proven useful for variable selection and input sensitivity analysis in contexts where the amount of data available traditionally swamps other comparators, such as GP models [21]. Software is available through the R package dynaTree [22].
The greatest potential of dynamic trees may lie in sequential design contexts, where the model fit is allowed to recommend the future inputs x t+1 (and corresponding outputs) on which it is trained. The original dynamic trees paper [18] suggested a heuristic called active learning-Cohn (ALC), which has been used to approximate maximum information designs in serial applications [23]. The ALC method involves choosing among new potential inputs x the one that gives the largest reduction in predictive variance averaged over the input space. For most response surface models (e.g., GPs), calculating this aggregated statistic requires numerical methods. However, conditional on the tree structure, it is analytic for dynamic trees-representing a large computational savings. The resulting designs allocate a heavier concentration of points in areas where the response surface is changing rapidly and put correspondingly fewer points in areas of the input space that are easier to predict.
Unfortunately, this methodology is tailored to one-at-a-time sequential design and model updating. That is, in a general iteration t, one recommends a new x t+1 based on an N -particle approximation {T ; obtains y t+1 ; and updates the particle approximation to {T Many computer experiments are a hybrid of batch and sequential (see, e.g., [17]), which means that this scheme requires modification for the types of experiments we have in mind.

III. PARALLEL ACTIVE LEARNING WITH DYNAMIC TREES
The main obstacle that precludes direct adoption of any active learning scheme for parallel environments, whether based on dynamic trees or otherwise, is a posteriori importance. An active learning method, say ALC, can easily suggest a batch of parameter configurations for parallel evaluation. However, most appropriate spatial models for computer experiments (e.g., dynamic trees) facilitate the learning of a correlation structure between all outputs. Therefore, once any configuration in the batch finishes evaluation and the model fit is updated, uncertainty may greatly be reduced for the other points in the batch, possibly destroying their utility in improving the fit in a subsequent update (once their simulations have completed).
In order to address this issue, the active learning algorithm should first identify the single best candidate location x i , say by ALC, and then identify other potential candidates whose predictive uncertainties are likely to be substantially reduced when x i gets selected for the parallel evaluation. This latter set should be avoided when selecting the second candidate for the batch, and so on. One way this can be achieved is by treating the active learning model's prediction for x i as "correct" as soon as x i is selected for evaluation. We suggest using the fitted surrogate model to predict the outputỹ i of x i and update the model with (x i ,ỹ i ). This approach will reduce the predictive variance of configurations that depend on (x i , y i ), which may be all configurations for stationary GPs or just those sharing a leaf with x i in the dynaTree particle approximation. Since the predicted values from the fitted surrogate model may not be accurate-in particular during the initial iterations or when the algorithm moves to a previously unseen part of the input space-when the configurations in the batch are evaluated, the model has to unlearn the (x i ,ỹ i ) and relearn the value from the original evaluation (x i , y i ).
Algorithm 1 (ab-dynaTree, where "a" and "b" stand for active learning and batch mode, respectively) summarizes the new iterative active learning algorithm. The symbols ∪, −, and | · | denote set union, difference, and cardinality operators, respectively. The algorithm takes a set X p of unevaluated configurations and a maximum number of codevariant evaluations n max as input. The two parameters of the algorithm are the subsample size n s and batch size n b , with n b ≤ n s . The initialization phase (lines 1-4) consists of sampling n s configurations at random from X p , obtaining a set of corresponding outputs, and using them as a training set to build a dynamic tree model M. At each iteration, a configuration x i that maximizes the ALC statistic is selected from the subsample set X s and added to the batch set X b . The a posteriori importance issue within the same batch is addressed by using two models M and M imp . The output y i of x i is predicted by using the imputed model M imp and is updated with (x i ,ỹ i ) (lines [11][12]. As soon as the batch evaluation is over, the relearning phase is realized by copying the current active learning model M to M imp (line 16). We can also use the training points (X out , Y out ) obtained from active learning as a training set to build surrogate models using other regression approaches.
Algorithm 1 ab-dynaTree. Input: pool of configurations X p , max evaluations n max , batch size n b ≤ |X p |, subsample size n s ≥ n b Before selecting the candidate configurations for a given batch, one can optionally select a subsample set according to a user-defined criterion (lines 6-9). Given a limited number of evaluations as a budget, and depending on the complexity of the unknown response function (for example, the number of disjoint local minima), there is a tradeoff between the number of training points and the prediction accuracy. In performance modeling for autotuning, one typically desires surrogates with higher prediction accuracy for configurations with good performance, rather than for configurations with poor performance. For this purpose, in the results described in the next section we take advantage of the optional procedure in Algorithm 1 to bias the sampling toward highquality parameter configurations. At each iteration, instead of considering the entire pool of unevaluated configurations X p for the ALC statistic computation, the algorithm first predicts the performance metric of each configuration in the pool, assigns a weight to each configuration that is inversely proportional to its performance metric, and selects a subset X s by weighted sampling. The poor-quality configurations (e.g., those with longer run times or higher power consumption) are thus queried less frequently even if they are deemed to improve the overall prediction accuracy.

IV. EXPERIMENTAL RESULTS
We now examine the effectiveness of ab-dynaTree in developing surrogate models of empirical performance data. Our goal is to determine whether active learning provides significant benefits over random search in parallel environments and whether the active learning approach is inherently tied to the dynaTree surrogate model.
To analyze the quality of evaluations performed by ab-dynaTree, we build models using three regression algorithms for the given problem. First we use the dynaTree algorithm (dT) with 10 repetitions, as recommended by the package authors, taking the prediction at each x as the mean of the 10 predictions. We compare this algorithm with two others: random forest (rf), a state-of-the-art and robust treebased regression approach, and neural networks (nn), which has been shown to be effective for surrogate modeling [5]. For each algorithm, we consider two variants: al, in which the points obtained from ab-dynaTree are used for training the model, and rs, in which points are selected at random. The inclusion of rf and nn allows us to assess whether the obtained training points can also benefit other learning algorithms.
The parameter values of all these regression algorithms have a significant impact on the prediction accuracy. To avoid bias due to parameter tuning, we use the default parameter values (as in [22]) for dT and rf. Since our exploratory studies showed that the prediction accuracy of nn variants is poor, we use the tuned parameter values as suggested in [5]. We note that the results are biased toward nn variants because of this parameter tuning.
As a measure of prediction accuracy, we use root-meansquared error (RMSE). We repeat each variant 10 times to reduce the impact of randomness throughout, and we consider the prediction accuracy of a variant as RMSE averaged over 10 repetitions. We also use a t-test to check whether the observed differences in the prediction accuracy of the variants are significant.

A. Modeling run times on serial codes
The first set of experiments was carried out on dedicated nodes of Fusion, a 320-node cluster at Argonne National Laboratory, comprising 2.6 GHz Intel Xeon processors with 36GB of RAM, under the stock Linux kernel v2.6.18.
We build surrogate models for problems from SPAPT [24], a collection of portable search problems in automatic performance tuning. Each problem in SPAPT is defined by a kernel, input size, set of tunable parameters, feasible set of possible parameter values, and default/initial configuration of these parameters. The kernels in SPAPT include elementary dense linear algebra, dense linear algebra solver, stencil code, and elementary statistical computing kernels. . SPAPT problems also include binary parameters; however, here we set these to their nominal value (false) and consider only those parameters that can take several (integer) values, since these are primarily responsible for the size of search spaces in SPAPT. The resulting number of parameters ranges between 8 and 38, and the size of the search spaces range between 5.31 × 10 10 and 1.24 × 10 53 .
Given a problem, our goal is to build a surrogate model that can predict the mean run time of a parameter configuration x. The mean is computed over 35 generated code runs. For ab-dynaTree, we set the subsample size as n s = 100 and batch size as n b = 50. We generated 100,000 unevaluated configurations for X p and set the maximum evaluations n max = 5, 000.
To allow cross-comparison of prediction accuracy between the problems, we scale the run time values for each problem: each y i is divided by y imax , where y imax is the maximum run time from Y out . For the active learning variants, we consider the first 2,500 points from (X out ,Y out ) as the training set to build the surrogate model. We derive two test sets from the remaining 2,500 points: (i) the subset of points T 25% from the training set whose mean run times are within the lower 25% quantile of the empirical distribution for the 5,000 run times in Y out ; and (ii) a set T 100% of 1,000 randomly generated points. For the random sampling variants, we use the same test sets T 25% and T 100% but a different training set, with the 2,500 points for training being randomly chosen from (X out , Y out )− T 25% . Since the training points of random sampling variants are not entirely random, their perceived effectiveness may be artificially higher than one could expect in reality. Table I-A shows RMSEs averaged over 10 replications for 2,500 training points and tested on T 25% . The results show that, except for bicgkernel, dT(al) obtains lower average RMSE than does dT(rs). The trend is similar on the nn variants. We can also observe that the dT variants completely dominate the nn variants despite the latter using tuned parameter values. The key advantage of dT(al) comes from it requiring relatively fewer evaluations to achieve a smaller RMSE. This is illustrated in Fig. 1, which shows the RMSE as a function of the number of training points. Since the results of the nn variants are poor, we did not include them in the figure. The results show that dT(al) achieves a lower RMSE than does dT(al) with relatively fewer training points. In Fig. 2, we compare the number of evaluations required by the variants to reach the RMSE obtained by dT(al) (with 2,500 evaluations). On 5 out of 10 problems, dT(al) reaches the RMSE of dT(rs) within 1,000 training points. Only for bicgkernel and mvt is there no significant difference in the number of training points.

B. Modeling power consumption of serial code kernels
In this section, we focus on modeling the power consumption of serial code kernels. We obtained the component-level (CPU and DIMM) power consumption data used in [5] for mm, stencil, and lu computations. The data was collected on an Intel Xeon E5530 workstation with two quad-core processors, where each core has 32KB L1 cache and 256KB L2 cache; see [5] for further details.
The mm, stencil, and lu kernels have 7, 5, and 11 parameters and search space sizes of 6.5 × 10 5 , 5.6 × 10 10 , and 8.3 × 10 10 configurations, respectively. The parameters are tiles, unroll, input size, and clock frequency. The data set comprises 8,285, 4,900, and 9,700 randomly sampled configurations for mm, stencil, and lu, respectively. We stored these configurations and their corresponding outputs in a lookup table and simulated the batch mode for ab-dynaTree, where the randomly sampled configurations were given as the configuration pool.
The results are shown in Fig. 3. We observe that al variants obtain lower RMSE when compared with rs variants, suggesting that the active learning approach is beneficial. The results from Table I-B show that dT(al) obtains significantly lower RMSE than do other variants except for stencil, where nn variants obtain slightly lower RMSE than does dT(al). The computational savings are shown in Fig. 4, where we can see that dT(al) requires fewer than 1,000 training points to reach the RMSE of dT(rs). The correlation between observed and predicted values of dT(al) is shown in Fig. 5.

C. Modeling run time and FLOPS of MPI code
In this section, we model run time and FLOPS of miniFE, a mini application that comprises the most significant performance characteristics of an implicit finite element method using a simple un-preconditioned conjugate-gradient (cg) solver; see [25] for details. We used the following parameters: We model run times and FLOPS of the cg kernel in miniFE. For each evaluation, in addition to the run time, we record the FLOPS taken by cg and the run time and FLOPS taken by the dot product (dot), matrix-vector product (matvec), and vector-scalar product (waxpy) kernels within cg. We run ab-dynaTree only once with the objective of building a model for the run time and with the maximum number of evaluations of 2,500. In this study, we use the same evaluations for modeling both the FLOPS and run time metrics. For the validation experiments, the al variants use the first 1,500 configurations obtained from ab-dynaTree. The results are shown in Fig. 6, Fig. 7, and Table I-D. The differences in mean RMSE between dT(al) and dT(rs) are small (in the range of 0.001 to 0.003) and the t-test shows that, out of the 8 problems, only on 4 problems does dT(al) obtain an RMSE significantly lower than that of dT(rs). On the remaining four problems, we do not have significant evidence to say that one is better than the other. However, the differences between the al and rs variants of nn and rf are larger and clearly show that adoption of ab-dynaTree results in reducing RMSE. The overall low RMSE values and illustrative scatter plots of cg and dot in Fig. 8 show that the training points, though obtained with the goal of modeling run times of cg, are highly effective for obtaining surrogate models for the other 7 metrics.

D. Experiments on batch size
In this section, we study the impact of batch size n b in ab-dynaTree on the RMSE. We used the same experimental setting and data as described IV-B. For the ab-dynaTree, we consider values in {1, 50, 100, 200}. Note that n b = 1 corresponds to the sequential version. Thus, each iteration of ab-dynaTree consists in n b parallel evaluations. The subsample size is set to 2×n b except for the sequential version in which it is set to 100. The training points obtained from ab-dynaTree is given to dT(al) and all these versions start with the initial sample size of 100. We assume each evaluation takes the same time unit. The results are given in Figure 9.
The results show that the batch mode ab-dynaTree versions are between one to two orders of magnitude faster than the sequential version. The RMSE of the sequential version after 2500 evaluations is worse than that of batch mode versions. On lu (cpu), lu (dimm), mm (dimm) the adoption n b = 50 allow ab-dynaTree to reach lower RMSE than those with 100 and 200. However, on mm (cpu), n b = 200 is better. These observations indicate that there is a tradeoff between n b and the RMSE: adoption of large n b allows ab-dynaTree to rapidly reduce the RMSE but since it can perform only few iterations and the final RMSE is worse than others; on the other hand smaller valuers of n b increases the number of iterations but allows ab-dynaTree to learn the response function effectively.

V. SUMMARY AND OUTLOOK
We have proposed an algorithm that adaptively selects inputs for parallel evaluation in order to build efficient surrogate models over the input space. We augmented a popular active learning scheme for the sequential design of experiments to ensure that each simulation node is working on disparate yet useful inputs whose outputs, taken collectively, will lead to updates nearly informative as one-at-a-time schemes previously deployed in serial environments. Our experiments show that this scheme can be effective for building surrogates for empirical autotuning metrics such as run time, power consumption, and FLOPS on a variety of architectures. For the surrogate model types tested, including our preferred dynamic tree model (leading to ab-dynaTree) and comparators such as random forests and neural networks, our active learning approach yields input points that produce better surrogates ones derived from random sampling. Naturally, the significance of this benefit, especially as pertaints to the benefit of one surrogate model (e.g., ab-dynaTree) over others, depends on the output characteristics over the input space.
There are a number of straightforward extensions that would improve the power of ab-dynaTree for practical autotuning. For example ab-dynaTree can be made completely asynchronous by overwriting M imp with M whenever new data becomes available, rather when the batch has fully been evaluated. Perhaps a scheme similar to that deployed with TGP for a rocket booster experiment [17] would be appropraite. In that work, the authors simply substituted predictive means in for the outputs of running simulations, which is less sophisticated than the a posteriori calculations we propose here. Similarly, it may be advantageous to design several related experiments (e.g., for run time, power consumption, and FLOPS) at once, i.e., having multiple surrogates (one for each) sharing a single design. One option is to deploy a roundrobin scheme with independent surrogates [17], which in the batch setting is equivalent to allocated a fixed portion of new runs to be chosen by each surrogate. Although that represents a sensible first step, it ignores the correlation structure between obviously related outputs. E.g., power consumption is likely to be negatively correlated with run time. The two-stage model proposed in [26], although it remains to be seen if if extensions for an efficient ALC-like calculation could be derived in that context.
Wasn't sure what to do with these other ones -I'll leave them for one/both of you.  • ab-dynaTree can incorporate existing code variant evaluations to build surrogates with fewer (new) evaluations.
• Capturing/modeling metrics at a finer granularity (e.g., run times of individual kernels or input sizes composing a full-scale application code) to exploit additional structure for whole application modeling and tuning.