Simulation output analysis

Introduction

This notebook analyzes the results of the resource management experimental campaign of the article “Light-weight prediction for improving energy consumption in HPC platforms” (sections 6.4 and 6.5) published at Euro-Par 2024. For full context of this experiment please refer to the article preprint, which is available on [hal long-term open-access link].

Résumé of the experiment. 30 different 1-day workloads have been extracted at random points in time from the Marconi 100 trace. Each workload has been replayed in simulation thanks to the Batsim simulator [thesis link] [long-term software heritage code permalink]. A constraint is set on the power that the whole platform can use during the first 3 hours of the simulation. This is implemented on our EASY backfilling implementation that supports a powercap [gitlab code link] [software heritage long-term code permalink], which uses a prediction of the jobs’ power consumption to take its decisions.

The goal of this notebook is to determine the impact of the job power predictor on the schedules resulting from this scheduling algorithm execution. The notebook takes an aggregation of all the simulation executions as input. The notebook outputs image files that are Figures 4 and 5 of the article, and also provides additional analyses (images + short text analysis) that could not fit in the article page limit.

Power predictor naming difference w.r.t. article

upper_bound is the predictor named naive in the article. It assumes that all the nodes allocated to the job are at full power during the whole job execution. This is an upper bound on the job power consumption that can be used safely from the scheduler point of view.
real_mean is the predictor that uses the real mean power of each job (perfect oracle, unfeasible in practice but shows the best we would get with a perfect predictor)
real_max is the predictor that uses the real maximum power of each job (perfect oracle, unfeasible in practice but shows the best we would get with a perfect predictor)
mean is the history-based (light-weight) mean job power predictor described in section 4.1 of the article
max is the history-based (light-weight) maximum job power predictor described in section 4.1 of the article
zero assumes that all jobs consume 0 W. This is strictly equivalent to EASY backfilling without powercap support, and is used as baseline for scheduling metrics.

Code to read and prepare data

set.seed(1)
suppressMessages(library(tidyverse))
suppressMessages(library(viridis))
library(knitr)

# data extracted from the analysis of the M100 real trace from 2022-01 to 2022-09
nb_nodes = 980
max_observed_total_power = 955080
max_power_per_node = 2100.0
min_power_per_node = 240.0
max_dynamic_power = max_observed_total_power - min_power_per_node * nb_nodes

# data from the simulation campaign definition
constrained_time_window_duration_seconds = 60*60*3 # 3 hours

# read input data, fix types, reorder predictor and split predictors in categories
data = read_csv(params$simulation_aggregated_output, show_col_types = FALSE) %>% mutate(
  start_dt_s = as.factor(start_dt_s),
  job_power_estimation_field = as.factor(job_power_estimation_field)
)
data$predictor_name = factor(data$predictor_name,
  levels=c('upper_bound', 'max', 'real_max', 'real_mean', 'mean', 'zero'))
data = data %>% mutate(
  predictor_metrics = ifelse(predictor_name %in% c('real_max', 'max'), 'max',
                      ifelse(predictor_name %in% c('real_mean', 'mean'), 'mean',
                      'naive'
  )),
  predictor_method = ifelse(predictor_name %in% c('mean', 'max'), 'predicted', 'real')
)
data$predictor_metrics = factor(data$predictor_metrics, levels=c('naive', 'max', 'mean'))
data$predictor_method = factor(data$predictor_method, levels=c('predicted', 'real'))

# compute scheduling metrics against their matching EASY baseline
data_nz = data %>% filter(predictor_name != 'zero')
data_z = data %>% filter(predictor_name == 'zero' &
                         powercap_dynamic_value_ratio == max(data$powercap_dynamic_value_ratio))
data_z_joinable = data_z %>% transmute(
  start_dt_s = start_dt_s,
  zero_mean_utilization = mean_utilization,
  zero_max_utilization = max_utilization,
  zero_mean_turnaround_time = mean_turnaround_time,
)

data_nz = inner_join(data_nz, data_z_joinable, by='start_dt_s') %>% mutate(
  mean_turnaround_time_minus_zero = mean_turnaround_time - zero_mean_turnaround_time,
) %>% mutate(
  mean_turnaround_time_increase_ratio = mean_turnaround_time_minus_zero / zero_mean_turnaround_time
)

Consistency checks

This section inspects the simulation data to make sure the values are consistent with our expectations on the algorithm.

During the constrained time window, is the utilization proportional to the powercap value for each (predictor, workload)? it should

data_nz %>% ggplot(aes(x=powercap_dynamic_value_ratio, y=mean_utilization / nb_nodes, color=predictor_name)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  geom_abline(slope=1) +
  theme_bw() +
  theme(legend.position='top', legend.title=element_blank()) +
  guides(color = guide_legend(nrow = 1)) +
  scale_x_continuous(breaks=seq(0.1,0.7,0.2), labels = scales::percent) +
  scale_y_continuous(breaks=seq(0,1,0.2), labels = scales::percent) +
  scale_color_viridis(discrete=TRUE) +
  expand_limits(x=0) +
  facet_wrap(vars(start_dt_s)) +
  labs(
    y="Utilization (proportion of nodes)",
    x="Powercap value (proportion of the maximum dynamic power range)"
  )

## `geom_smooth()` using formula = 'y ~ x'

Conclusion: Yes, almost perfectly proportional for all (workload, predictor) before the utilization becomes saturated. On 5/30 workloads the max predictor slightly jumps from one linear trend to another. This is consistent with the first-fit policy of the scheduling algorithm, here we think that EASY becomes able to execute a job whose power consumption is over estimated by max, and that EASY then cannot backfill smaller jobs since it thinks that there is not enough available power.

During the constrained time window, is the utilization roughly proportional to the powercap value for each predictor regardless of the workload?

data_nz %>% ggplot(aes(x=powercap_dynamic_value_ratio, y=mean_utilization / nb_nodes, color=predictor_name)) +
  geom_jitter(width=1/100, height=0) +
  geom_smooth(method = "lm", se = FALSE) +
  geom_abline(slope=1) +
  theme_bw() +
  theme(legend.position='top', legend.title=element_blank()) +
  guides(color = guide_legend(nrow = 1)) +
  scale_x_continuous(breaks=seq(0,0.7,0.1), labels = scales::percent) +
  scale_y_continuous(breaks=seq(0,1,0.2), labels = scales::percent) +
  scale_color_viridis(discrete=TRUE) +
  expand_limits(x=0) +
  labs(
    y="Utilization (proportion of nodes)",
    x="Powercap value (proportion of the maximum dynamic power range). Shown with horizontal jitter."
  )

## `geom_smooth()` using formula = 'y ~ x'

Conclusion: Yes this is roughly proportional. For some workloads the utilization is saturated while using the real_mean/mean predictor for powercap > 55 %.

Per-workload analysis

This section analyzes how the algorithm behaves on each workload. We believe that this is the most important analysis section of this notebook, as scheduling results must be looked at for each workload to make sense.

During the constrained time window, how much power is consumed on average for each (predictor, workload)?

data_nz %>% ggplot(aes(x=powercap_dynamic_value_ratio, y=mean_power/max_dynamic_power, color=predictor_name)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  geom_abline(slope=1) +
  theme_bw() +
  theme(legend.position='top', legend.title=element_blank()) +
  guides(color = guide_legend(nrow = 1)) +
  scale_x_continuous(breaks=seq(0.1,0.7,0.2), labels = scales::percent) +
  scale_y_continuous(breaks=seq(0,1,0.2), labels = scales::percent) +
  scale_color_viridis(discrete=TRUE) +
  expand_limits(x=0) +
  facet_wrap(vars(start_dt_s)) +
  labs(
    y="Mean power (proportion of maximum observed M100 power)",
    x="Powercap value (proportion of the maximum dynamic power range)"
  )

## `geom_smooth()` using formula = 'y ~ x'

Conclusions: The platform dynamic mean power consumption is linear to the powercap value (unless the platform is already saturated and increasing the powercap has small effect), which is expected. Furthermore we can clearly see that:

This plot has the same shape as the corresponding utilization plot, which is expected. The main difference is that the maximum mean power consumption is around 70 % while the utilization goes up to 100 %.
The real_mean > real_max > upper_bound predictor order in terms of mean power consumption holds for all workloads.
The mean > max > upper_bound predictor order in terms of mean power consumption holds for all workloads.
Using the mean history-based predictor instead of the real value real_mean (which cannot be used in practice as it is unknown at decision taking time, but which represents a perfect oracle estimator without error) has almost no impact on the power used during the constrained time window.
Using the max history-based predictor instead of the real value real_max (which cannot be used in practice as it is unknown at decision taking time, but which represents a perfect oracle estimator without error) decreases the mean power consumption. The decrease is very small on some workloads (e.g., almost no impact on workload 19389030), but quite strong on other workloads (e.g., on workload 10061708 while using powercap=70%, the mean power consumption moves from ~50 % with real_max to ~30 % with max).

How is the scheduling performance (as measured by mean turnaround time) impacted by each predictor, for all workloads?

data_nz %>% ggplot(aes(x=powercap_dynamic_value_ratio, y=mean_turnaround_time_minus_zero, color=predictor_name)) +
  geom_point() +
  #geom_smooth(method = "lm", se = FALSE) +
  geom_hline(yintercept=0) +
  theme_bw() +
  theme(legend.position='top', legend.title=element_blank()) +
  guides(color = guide_legend(nrow = 1)) +
  scale_x_continuous(breaks=seq(0,0.7,0.2), labels = scales::percent) +
  scale_y_continuous() +
  scale_color_viridis(discrete=TRUE) +
  facet_wrap(vars(start_dt_s)) +
  expand_limits(x=0) +
  labs(
    y="Mean turnaround time difference against EASY without any powercap for each simulation (seconds)",
    x="Powercap value (proportion of the maximum dynamic power range)"
  )

Problem: The scheduling performance is expected to be degraded on most instances as 1. this algorithm does not directly change the job ordering and 2. the lower the powercap, the lower the utilization. This means the mean turnaround time difference should be positive. Workload 18474670 is clearly an outlier here, as the workload scheduling performance has been greatly improved by the powercap for most instances. We think that in most instances of this workload the powercap prevented big jobs (in number of requested resources and area) to be executed, which enabled a lot of small jobs to be executed, which improved the mean turnaround time metrics.

Here is a look at the data without the outlier workload.

outlier_workload_start_dt_s = 18474670 # sched metrics are strongly better than EASY on it
data_nz %>%
  filter(start_dt_s != outlier_workload_start_dt_s) %>%
  ggplot(aes(x=powercap_dynamic_value_ratio, y=mean_turnaround_time_minus_zero, color=predictor_name)) +
  geom_point() +
  geom_smooth(method = "lm", se = TRUE) +
  geom_hline(yintercept=0) +
  theme_bw() +
  theme(legend.position='top', legend.title=element_blank()) +
  guides(color = guide_legend(nrow = 1)) +
  scale_x_continuous(breaks=seq(0,0.7,0.2), labels = scales::percent) +
  scale_y_continuous() +
  scale_color_viridis(discrete=TRUE) +
  facet_wrap(vars(start_dt_s)) +
  expand_limits(x=0) +
  labs(
    y="Mean turnaround time difference against EASY without any powercap for each simulation (seconds)",
    x="Powercap value (proportion of the maximum dynamic power range)"
  )

## `geom_smooth()` using formula = 'y ~ x'

Conclusions: The scheduling performance degradation is clearly linear to the powercap value on some workloads (e.g., 3079185 and 7934521), has a linear trend but with noise on most workloads (e.g., 17539280), and is not linear on some workloads (e.g.,19389030 for predictors that are not upper_bound). Additionnally:

The mean > max > upper_bound predictor order in terms of scheduling performance holds for most workloads.
The real_mean > real_max > upper_bound predictor order in terms of scheduling performance holds for most workloads.
The mean turnaround time difference metrics spans on the same range of values for most workloads, which means this metrics can be aggregated over all workloads without a per-workload normalization step.
Similarly to the mean power consumption during the constrained time window metrics, using mean instead of real_mean seems to have very little impact on the mean turnaround time metrics on most workloads.
Using max instead of real_max has a small impact (small performance degradation) on the mean turnaround time metrics on most workloads.

Analysis aggregating all workloads together

While we think that per-workload analysis is the most relevant, it obviously cannot fit in the 1.5-page window dedicated to the analysis in the article, as per-workload view of the data takes a lot of place.

This section aggregates the result seen previously in smaller figures that can fit in the paper, and does additional analysis on the whole dataset.

During the constrained time window, how far is the mean power compared to the powercap value for each predictor?

data_nz %>%
  #filter(powercap_dynamic_value_ratio %in% powercap_ratios_values_to_show) %>%
  mutate(powercap_label = sprintf("pcap=%g", powercap_dynamic_value_ratio)) %>%
  ggplot() +
  geom_hline(aes(yintercept=powercap_dynamic_value_ratio)) +
  geom_boxplot(aes(y=mean_power/max_dynamic_power, fill=predictor_method, x=predictor_metrics)) +
  theme_bw() +
  theme(
    legend.position=c(0.2, 0.9),
    legend.direction='horizontal',
    legend.title=element_blank(),
    legend.background=element_rect(color='black'),

    axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1),
  ) +
  expand_limits(x=0) +
  scale_y_continuous(breaks=seq(0,0.7,0.1)) +
  facet_wrap(vars(powercap_label), nrow=1) +
  labs(
    y="Mean platform power consumption",
    x="Job power estimator"
  ) +
  scale_fill_grey(start=0.8, end=1)

Description. This figure is very similar to the per-workload mean power consumption plot done in the previous section, but the mean power values are aggregated per workload and the predictor naming now uses the predicted metrics (x axis) and whether it is the real value or the predicted one (fill color). Standard ggplot boxplots are used, which show the first (25 %) second (median, 50 %) and third (75 %) quartiles, and outlier values are shown as points if they are further away than 1.5 * the distance between the first and third quartile.

The final version seen in the article (Figure 4) is very similar, but for the sake of font readibility only half of the powercap values are shown.

powercap_ratios_values_to_show = seq(0.1, 0.7, 0.1)
scale=0.9
width_scale=0.3
data_nz %>%
  filter(powercap_dynamic_value_ratio %in% powercap_ratios_values_to_show) %>%
  mutate(powercap_label = sprintf("pcap=%g", powercap_dynamic_value_ratio)) %>%
  ggplot() +
  geom_hline(aes(yintercept=powercap_dynamic_value_ratio), linewidth=width_scale) +
  geom_boxplot(aes(y=mean_power/max_dynamic_power, fill=predictor_method, x=predictor_metrics), linewidth=width_scale, outlier.size=width_scale) +
  theme_bw() +
  theme(
    legend.position=c(0.2, 0.9),
    legend.direction='horizontal',
    legend.title=element_blank(),
    legend.background=element_rect(color='black'),
    axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1),
  ) +
  expand_limits(x=0) +
  scale_y_continuous(breaks=seq(0,0.7,0.1)) +
  facet_wrap(vars(powercap_label), nrow=1) +
  labs(
    y="Mean platform power consumption",
    x="Job power estimator"
  ) +
  scale_fill_grey(start=0.8, end=1)

ggsave("./fig4-sched-mean-power-distribution.svg", width=8*scale, height=4*scale)

Here is the code that produces the summarized power underutilization values seen in Section 6.5 of the article. The power difference to the powercap is normalized by the powercap value for each instance, such that the aggregation of the values makes sense (otherwise big powercap values would dominate the aggregation). The average value has been used in the article.

t = data_nz %>%
  mutate(power_underutilization_ratio = (powercap_dynamic_watts - mean_power)/powercap_dynamic_watts) %>%
  group_by(predictor_name) %>%
  summarize(
    average_power_underutilization_ratio = mean(power_underutilization_ratio),
    median_power_underutilization_ratio = median(power_underutilization_ratio),
  )
knitr::kable(t)

predictor_name	average_power_underutilization_ratio	median_power_underutilization_ratio
upper_bound	0.7392531	0.7381094
max	0.4356219	0.4404844
real_max	0.3457243	0.3519978
real_mean	0.0128441	0.0018967
mean	-0.0254606	-0.0322494

How is the scheduling performance degraded by each predictor?

Very similarly to the previous plot, here is how Figure 5 of the article is produced.

data_nz %>%
  filter(start_dt_s != outlier_workload_start_dt_s) %>%
  filter(powercap_dynamic_value_ratio %in% powercap_ratios_values_to_show) %>%
  mutate(powercap_label = sprintf("pcap=%g", powercap_dynamic_value_ratio)) %>%
  ggplot() +
  geom_boxplot(aes(y=mean_turnaround_time_minus_zero, fill=predictor_method, x=predictor_metrics), linewidth=width_scale, outlier.size=width_scale) +
  theme_bw() +
  theme(
    legend.position=c(0.16, 0.12),
    legend.direction='horizontal',
    legend.background=element_rect(color='black'),
    legend.title=element_blank(),

    axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1),
  ) +
  facet_wrap(vars(powercap_label), nrow=1) +
  labs(
    y="Mean turnaround time increase (s)",
    x="Job power estimator"
  ) +
  scale_fill_grey(start=0.8, end=1)

ggsave("./fig5-sched-mtt-diff-distribution.svg", width=8*scale, height=4*scale)

Here is the code that produces the summarized scheduling performance degradation values seen in Section 6.5 of the article. The scheduling performance has already been normalized by the performance of EASY without powercap for each instance. The average_mtt_increase_ratio value (average of normalized mean turnaround time difference) has been used in the article.

t = data_nz %>%
  filter(start_dt_s != outlier_workload_start_dt_s) %>%
  group_by(predictor_name) %>%
  summarize(
    average_mtt_increase = mean(mean_turnaround_time_minus_zero),
    average_mtt_increase_ratio = mean(mean_turnaround_time_increase_ratio),
    median_mtt_increase = median(mean_turnaround_time_minus_zero),
  )
knitr::kable(t)

predictor_name	average_mtt_increase	average_mtt_increase_ratio	median_mtt_increase
upper_bound	8691.197	0.1540434	8855.663
max	6109.214	0.1087719	6364.189
real_max	5331.574	0.0959661	5484.921
real_mean	3419.712	0.0625572	2996.683
mean	3373.779	0.0599860	3228.024

How much energy is consumed during the time window compared to the energy that should be used by being at the powercap value for the whole window duration?

data_nz %>% ggplot() +
  geom_hline(yintercept=0) +
  geom_violin(aes(x=predictor_name, y=energy_from_powercap / 1e9)) +
  geom_jitter(aes(x=predictor_name, y=energy_from_powercap / 1e9), alpha=0.1) +
  geom_boxplot(aes(x=predictor_name, y=energy_from_powercap / 1e9), width=0.025, outlier.shape=NA) +
  theme_bw() +
  labs(
    x="Power predictor",
    y="Distribution of the energy consumed (GJ)"
  )

Conclusions: Energy values are consistent with the previous power plots. We can see that only the mean and real_mean used more energy than what the powercap enables on the analyzed workloads. We can see that mean frequently leads to more energy being used than what the powercap enables.

How is the powercap exceeded during the time window for each predictor?

Whether the powercap has been exceeded or not has been computed for each second of each simulation.

data_nz %>% ggplot() +
  geom_hline(yintercept=0) +
  geom_violin(aes(x=predictor_name, y=nb_seconds_above_powercap/constrained_time_window_duration_seconds)) +
  geom_jitter(aes(x=predictor_name, y=nb_seconds_above_powercap/constrained_time_window_duration_seconds), alpha=0.1) +
  geom_boxplot(aes(x=predictor_name, y=nb_seconds_above_powercap/constrained_time_window_duration_seconds), width=0.025, outlier.shape=NA) +
  theme_bw() +
  labs(
    x="Power predictor",
    y="Proportion of time above powercap"
  ) +
  scale_y_continuous(labels = scales::percent)

Conclusions: Only real_mean and mean exceed the powercap on the analyzed workloads. They both exceed the powercap frequently, but mean breaks the powercap more frequently than real_mean.

Here is the code that produces the summarized maximum instantaneous powercap break values seen in Section 6.5 of the article. The maximum instantaneous powercap break (in watts) is used (named max_power_from_powercap in the code). For each simulation, the powercap break (current power minus powercap) has been computed for each second during the time window, and max_power_from_powercap is the maximum of all these values. The value is normalized by the powercap so that the aggregation makes sense (otherwise the difference would be distorted and big powercap values would dominate the result). The average and median values have been used in the article.

t = data_nz %>%
  mutate(powercap_break = pmax(max_power_from_powercap, 0)) %>%
  mutate(powercap_break_ratio = powercap_break / powercap_dynamic_watts) %>%
  group_by(predictor_name) %>%
  summarize(
    mean_powercap_break_ratio = mean(powercap_break_ratio),
    median_powercap_break_ratio = median(powercap_break_ratio),
  )
knitr::kable(t)

predictor_name	mean_powercap_break_ratio	median_powercap_break_ratio
upper_bound	0.0000000	0.0000000
max	0.0000000	0.0000000
real_max	0.0000000	0.0000000
real_mean	0.1022596	0.0815692
mean	0.1472294	0.1419211

Similarly, here is the code that computes in how many cases the powercap is exceeded by all predictors.

nb_simus = data_nz %>%
  group_by(predictor_name) %>%
  summarize(
    total_count = n()
  )
breaks = data_nz %>%
  mutate(powercap_break = pmax(max_power_from_powercap, 0)) %>%
  filter(powercap_break > 0) %>%
  group_by(predictor_name) %>%
  summarize(
    break_count = n()
  )
t = inner_join(nb_simus, breaks, by="predictor_name") %>%
  mutate(break_ratio = break_count / total_count)
knitr::kable(t)

predictor_name	total_count	break_count	break_ratio
real_mean	390	372	0.9538462
mean	390	368	0.9435897

Erratum note: The first submitted version of the article states that mean breaks the powercap in 38 % of instances. The computation was wrong, mean breaks the powercap in 95 % of instances, and real_mean breaks the powercap 94 % of instances.