Custom Cost, Loss, And Reward Functions to Train Regression Models for Estimating Execution Resources using HARP.
Description
High-performance computing (HPC) is crucial in executing resource-intensive scientific workflows such as genome sequencing, weather predictions, and deep neural network (DNN) training. However, achieving optimal resource utilization during job execution requires selecting the right execution endpoint based on specific resource requirements. Different clusters, such as those at OSC or TACC, have diverse architectures, allocation guidelines, and billing policies, making it challenging for users to fine-tune resource allocations for their applications. To address this issue, we introduce the HARP (HPC Application Resource [runtime] Predictor) solution as part of the ICICLE project, an AI-based solution for HPC under the AI4CI initiative.
HARP enables application profiling and recommends optimal resource allocation configurations, reducing execution costs without compromising workflow execution times. In our previous papers, we demonstrated the architecture of HARP and its principal components. Now, we are exploring possible extensions for our preliminary framework in the following dimensions:
1. Loss Function: Traditional regression models use error functions like mean square error (MSE) or mean absolute error (MAE) to fit a model against training data. However, for resource allocations, underestimations are more expensive than over-estimations since they can lead to job termination. We propose a new loss function that biases models towards under-prediction, aiming to overestimate resource requirements to avoid job failures.
2. Metric System: Current evaluation metrics such as MSE, root mean square error (RMSE), or mean absolute percentage error (MAPE) treat under and over-estimations equally, potentially ranking systems inaccurately. To address this, we propose a new metric that prioritizes systems reducing under-predictions while minimizing over-estimations, offering a more balanced ranking of estimators.
3. Cost Function: To penalize the regression model for underestimating resources, we propose a reward system based on job execution status and re-execution costs. We simulate a "black box re-allocation policy" by doubling the execution time every time a job fails to execute with the suggested allocations. This policy expands available and expandable resources like cores, memory, or walltime until the job successfully executes.
4. Memory Modelling: Memory modeling is a crucial aspect when training deep neural network (DNN) models, as it relies on various factors such as model training parameters, optimizers, and training data. The ZeRO paper introduces a straightforward formula for estimating model parameters and memory optimizations in distributed training. We aim to leverage this equation to estimate memory requirements and identify a suitable system configuration capable of accommodating and effectively training the model from the available options.
The HARP code is available for download and installation on Linux-based systems, supporting standalone executions. Furthermore, HARP is currently integrated with TAPIS, allowing profiling and estimating walltime using HTTP API calls. We have submitted a Portal (to Science Gateways) to demonstrate how application hyperparameters influence execution endpoints using existing profiled datasets and a demo of HARP for profiling a new workflow. The portal also visualizes the recommendations to the users based on their configurations against the pre-identified DNN hyperparameters and system configurations.
Files
Gateways2023_paper_51.pdf
Files
(41.3 kB)
Name | Size | Download all |
---|---|---|
md5:11b7d556e23d04cb6baa612be82cb230
|
41.3 kB | Preview Download |