Marconi100 node power consumption analysis and modeling

Introduction

This notebook analyzes the power consumption of the Marconi 100 nodes during 2022 (from 2022-01 to 2022-09), as available in the M100 ExaData trace. This notebook is part of the work that has been conducted for the article “Light-weight prediction for improving energy consumption in HPC platforms” published at Euro-Par 2024. For full context of this work please refer to the article preprint, which is available on [hal long-term open-access link].

The goal of this notebook is to model how the nodes behave in terms of power consumption.

Read the aggregated data

set.seed(1)
suppressMessages(library(tidyverse))
suppressMessages(library(viridis))
library(knitr)

data = read_csv(params$m100_node_power_aggregation, show_col_types = FALSE) %>% transmute(
  power = as.integer(power),
  node = as.integer(node),
  nb_occ = as.integer(nbocc)
)

First look, consistency check, filtering

knitr::kable(summary(data))
power node nb_occ
Min. : 0 Min. : 0.0 Min. : 1
1st Qu.: 700 1st Qu.:248.0 1st Qu.: 704
Median :1080 Median :493.0 Median : 5859
Mean :1086 Mean :492.6 Mean : 15073
3rd Qu.:1460 3rd Qu.:738.0 3rd Qu.: 19952
Max. :2100 Max. :979.0 Max. :1105690
nb_nodes = nrow(data %>% select(node) %>% unique())
nb_measures = sum(data$nb_occ)
sprintf("There are %d nodes and %d measures", nb_nodes, nb_measures)
## [1] "There are 980 nodes and 1129106456 measures"

There are 980 nodes, which is consistent with the Marconi100 platform description. Every Marconi100 node should comprise 1 IBM Power9 AC922 @3.1GHz 32 cores CPU and 4x NVIDIA Volta V100 GPUs. Nvidia specifications state that the maximum power consumption of each GPU should be 250-300 W. The 2100 W maximum power value seems reasonable for us for such computing nodes.

However the minimum power value of 0 W is unexpected, as a fully idle node usually consumes more than 0 W.

Let us visualize the distribution of the power measures (regardless of nodes).

all_nodes_agg = data %>%
  group_by(power) %>%
  summarize(total_nb_occ = sum(nb_occ))

cumulated_all_nodes = all_nodes_agg %>% arrange(power) %>% mutate(
  cum_nb_occ = cumsum(total_nb_occ)
)

all_nodes_agg %>% ggplot(aes(x=power, y=total_nb_occ / nb_measures)) +
  geom_bar(stat = "identity") +
  theme_bw() +
  scale_y_continuous(labels = scales::percent) +
  labs(
    x = "Node power (W)",
    y = "Proportion of measures at this value"
  )

cumulated_all_nodes %>% ggplot() +
  geom_step(aes(x=power, y=cum_nb_occ), show.legend = FALSE, direction = 'hv') +
  theme_bw() +
  labs(
    x = "Power (W)",
    y = "Number of occurrences"
  )

We can see that the high end of the power values look “long-taily”, which is expected as applications rarely use the maximum power of the nodes.

Let us visualize the distribution (via eCDF) of each node.

cumulated_data = data %>% group_by(node) %>% arrange(power) %>% mutate(
  cum_nb_occ = cumsum(nb_occ),
  node = as.factor(node)
)
cumulated_data %>% ggplot() +
  geom_step(aes(x=power, y=cum_nb_occ, colour=node), show.legend = FALSE, direction = 'hv') +
  scale_colour_manual(values=rep("#00000020", nb_nodes)) +
  theme_bw() +
  labs(
    x = "Power (W)",
    y = "Number of occurrences for each node"
  )

We can see that only a single node seems to have small power values. Let us give a look at the data directly.

knitr::kable(all_nodes_agg)
power total_nb_occ
0 28249
240 5
260 800
280 40395
300 187926
320 676913
340 3417625
360 12869948
380 35668867
400 46233144
420 35889046
440 26661988
460 24679706
480 34520539
500 49192042
520 56276899
540 56652677
560 52290739
580 42634373
600 38241782
620 36501465
640 39144887
660 39825940
680 39815332
700 36367055
720 32887778
740 28969042
760 23559668
780 20170272
800 17025598
820 16134383
840 15269585
860 14878933
880 14597411
900 14492918
920 15224539
940 15791759
960 17501427
980 18072944
1000 18105463
1020 17374155
1040 14793233
1060 13115983
1080 10619106
1100 9221044
1120 7870502
1140 6684057
1160 5844072
1180 5179276
1200 4719445
1220 4174020
1240 4049973
1260 3705400
1280 3346327
1300 3133928
1320 2797036
1340 2764806
1360 2454752
1380 2319723
1400 2167781
1420 1970040
1440 1772095
1460 1533820
1480 1286163
1500 1028745
1520 868331
1540 682438
1560 557477
1580 470627
1600 385155
1620 338039
1640 291116
1660 247254
1680 209168
1700 170660
1720 136078
1740 105877
1760 80391
1780 53869
1800 38514
1820 21960
1840 12190
1860 6273
1880 2701
1900 1350
1920 646
1940 364
1960 174
1980 119
2000 90
2020 33
2040 8
2060 7
2100 3

We can see that the power values are discrete with a 20 W precision. This is consistent with the ExaData documentation which states that values have been obtained via IPMI from a BMC. This measurement system is mostly a failure control system not intended for high precision.

We can also see that all the measurements below 240 W are at 0 W. Let us see on which nodes these measurements come from.

nodes_with_0w_measures = data %>%
  filter(power == 0) %>%
  group_by(node) %>%
  summarize(total_nb_occ = sum(nb_occ))
knitr::kable(nodes_with_0w_measures)
node total_nb_occ
155 28249

All 0 W measures comes from node 155! As the 0 W values are unexpected and as all values come from the same node, we have decided to filter out 0 W values for the rest of this analysis and power modeling.

data = data %>% filter(power > 0)
cumulated_data = data %>% group_by(node) %>% arrange(power) %>% mutate(
  cum_nb_occ = cumsum(nb_occ),
  node = as.factor(node)
)

Node power modeling

On the previous per-node eCDF power plot, we could see that most nodes have a wide range of power values but that some of them were idle most of the time. This is more clearly shown on the following figure (nodes that have a median power value lower than 450 W are classified as lazy).

nb_measures_per_node = data %>%
  group_by(node) %>%
  summarize(nb_node_measures = sum(nb_occ)) %>%
  mutate(
    node = as.factor(node)
  )

median_power_value_per_node = inner_join(cumulated_data, nb_measures_per_node, by="node") %>%
  filter(cum_nb_occ > nb_node_measures / 2) %>%
  group_by(node) %>% summarize(
    median_power_value = min(power)
  ) %>% mutate(
    lazy = median_power_value < 450
  )


inner_join(cumulated_data, median_power_value_per_node, by="node") %>%
  mutate(facet_label = sprintf("lazy node ? %d", lazy)) %>%
  ggplot() +
  geom_step(aes(x=power, y=cum_nb_occ, colour=node), show.legend = FALSE, direction = 'hv') +
  scale_colour_manual(values=rep("#00000020", nb_nodes)) +
  theme_bw() +
  facet_wrap(vars(facet_label), ncol=1) +
  labs(
    x = "Power (W)",
    y = "Number of occurrences for each node"
  )

SimGrid’s computation hosts power model requires 3 power values : the minimum power of a powered on node (this is typically a CPU sleep state), the power when a tiny amount of work is done, and the power when the node is at full capacity.

As we are not doing a controlled experiment but using existing traces with little information about the applications that ran, we propose to use the minimum and maximum values of each node to instantiate this model.

Here is a unbiased (minimum and maximum values are in the plot, same linear scale for both axes) visualization of the minimum and maximum power values of all nodes. Each node is a point.

minmax_per_node = data %>%
  group_by(node) %>%
  summarize(
    min_power = min(power),
    max_power = max(power)
  ) %>% mutate(
    node = as.factor(node)
  )

p = minmax_per_node %>%
  ggplot() +
  geom_jitter(aes(x=min_power, y=max_power)) +
  theme_bw() +
  labs(
    x = "Node minimum power value (W)",
    y = "Node maximum power value (W)"
  )
p +
  expand_limits(x=0, y=0) +
  expand_limits(x=2100, y=2100)

Here is a zoomed view.

p

Vertical bands are expected since the x axis range is small and the power values are discrete with a 20 W precision. We can check whether the laziness of nodes impact their minimum and maximum power values.

inner_join(minmax_per_node, median_power_value_per_node, by="node") %>%
  ggplot() +
  geom_jitter(aes(x=min_power, y=max_power, shape=lazy, color=lazy)) +
  scale_color_viridis(end=0.8, discrete=TRUE) +
  theme_bw() +
  labs(
    x = "Node minimum power value (W)",
    y = "Node maximum power value (W)"
  )

Conclusion. Yes, we can clearly see that the maximum power of lazy nodes is smaller than the maximum power of non-lazy nodes. The minimum power of both groups seems similar though.

There seems to be much more non-lazy nodes than lazy nodes. How much exactly?

knitr::kable(median_power_value_per_node %>%
  group_by(lazy) %>%
  summarize(
    nb_nodes_in_group = n()
  ) %>% mutate(
    nb_nodes_in_group_ratio = nb_nodes_in_group / nb_nodes
  )
)
lazy nb_nodes_in_group nb_nodes_in_group_ratio
FALSE 960 0.9795918
TRUE 20 0.0204082

As the number of lazy nodes is small (2 % of total nodes) and that remaining close to minimum power consumption value half of the time does not seem to be the normal behavior of HPC nodes, we have decided to filter out lazy nodes for the rest of this analysis and modeling.

Here is a non-zoomed view on the non-lazy nodes.

non_lazy_minmax_nodes = inner_join(minmax_per_node, median_power_value_per_node, by="node") %>%
  filter(!lazy)

non_lazy_minmax_nodes %>%
  ggplot() +
  geom_jitter(aes(x=min_power, y=max_power), size=1/16, alpha=1/4) +
  theme_bw() +
  labs(
    x = "Non-lazy node minimum power value (W)",
    y = "Non-lazy node maximum power value (W)"
  ) +
  expand_limits(x=0, y=0) +
  expand_limits(x=2100, y=2100)

Conclusions

We would have liked to be able to run controlled applications on M100 nodes to have sane values of the power consumption of each node.

Here, only using the ExaData M100 traces, we think that the minimum and maximum power values of non-lazy nodes can be used to generate the power model of nodes similar to M100 nodes. However, these values cannot be used safely since we cannot be sure that the maximum value came from the same application execution. This limitation, in addition to the fact that we realized that the “Usage trace replay” Batsim profile type introduced in Batsim-4.2.0 that we planned to use had poor performance on large platforms such as Marconi100, led us to simply replay the power traces of each job a posteriori, instead of using SimGrid to compute power values during the simulation.

The following code produces the power model file needed by the script that generates the SimGrid platform.

write_csv(minmax_per_node, params$output_power_model_file)