Marconi100 node power consumption analysis and modeling

Introduction

This notebook analyzes the power consumption of the Marconi 100 nodes during 2022 (from 2022-01 to 2022-09), as available in the M100 ExaData trace. This notebook is part of the work that has been conducted for the article “Light-weight prediction for improving energy consumption in HPC platforms” published at Euro-Par 2024. For full context of this work please refer to the article preprint, which is available on [hal long-term open-access link].

The goal of this notebook is to model how the nodes behave in terms of power consumption.

Read the aggregated data

set.seed(1)
suppressMessages(library(tidyverse))
suppressMessages(library(viridis))
library(knitr)

data = read_csv(params$m100_node_power_aggregation, show_col_types = FALSE) %>% transmute(
  power = as.integer(power),
  node = as.integer(node),
  nb_occ = as.integer(nbocc)
)

First look, consistency check, filtering

knitr::kable(summary(data))

power	node	nb_occ
Min. : 0	Min. : 0.0	Min. : 1
1st Qu.: 700	1st Qu.:248.0	1st Qu.: 704
Median :1080	Median :493.0	Median : 5859
Mean :1086	Mean :492.6	Mean : 15073
3rd Qu.:1460	3rd Qu.:738.0	3rd Qu.: 19952
Max. :2100	Max. :979.0	Max. :1105690

nb_nodes = nrow(data %>% select(node) %>% unique())
nb_measures = sum(data$nb_occ)
sprintf("There are %d nodes and %d measures", nb_nodes, nb_measures)

## [1] "There are 980 nodes and 1129106456 measures"

There are 980 nodes, which is consistent with the Marconi100 platform description. Every Marconi100 node should comprise 1 IBM Power9 AC922 @3.1GHz 32 cores CPU and 4x NVIDIA Volta V100 GPUs. Nvidia specifications state that the maximum power consumption of each GPU should be 250-300 W. The 2100 W maximum power value seems reasonable for us for such computing nodes.

However the minimum power value of 0 W is unexpected, as a fully idle node usually consumes more than 0 W.

Let us visualize the distribution of the power measures (regardless of nodes).

all_nodes_agg = data %>%
  group_by(power) %>%
  summarize(total_nb_occ = sum(nb_occ))

cumulated_all_nodes = all_nodes_agg %>% arrange(power) %>% mutate(
  cum_nb_occ = cumsum(total_nb_occ)
)

all_nodes_agg %>% ggplot(aes(x=power, y=total_nb_occ / nb_measures)) +
  geom_bar(stat = "identity") +
  theme_bw() +
  scale_y_continuous(labels = scales::percent) +
  labs(
    x = "Node power (W)",
    y = "Proportion of measures at this value"
  )

cumulated_all_nodes %>% ggplot() +
  geom_step(aes(x=power, y=cum_nb_occ), show.legend = FALSE, direction = 'hv') +
  theme_bw() +
  labs(
    x = "Power (W)",
    y = "Number of occurrences"
  )

We can see that the high end of the power values look “long-taily”, which is expected as applications rarely use the maximum power of the nodes.

Let us visualize the distribution (via eCDF) of each node.

cumulated_data = data %>% group_by(node) %>% arrange(power) %>% mutate(
  cum_nb_occ = cumsum(nb_occ),
  node = as.factor(node)
)
cumulated_data %>% ggplot() +
  geom_step(aes(x=power, y=cum_nb_occ, colour=node), show.legend = FALSE, direction = 'hv') +
  scale_colour_manual(values=rep("#00000020", nb_nodes)) +
  theme_bw() +
  labs(
    x = "Power (W)",
    y = "Number of occurrences for each node"
  )

We can see that only a single node seems to have small power values. Let us give a look at the data directly.

knitr::kable(all_nodes_agg)

power	total_nb_occ
0	28249
240	5
260	800
280	40395
300	187926
320	676913
340	3417625
360	12869948
380	35668867
400	46233144
420	35889046
440	26661988
460	24679706
480	34520539
500	49192042
520	56276899
540	56652677
560	52290739
580	42634373
600	38241782
620	36501465
640	39144887
660	39825940
680	39815332
700	36367055
720	32887778
740	28969042
760	23559668
780	20170272
800	17025598
820	16134383
840	15269585
860	14878933
880	14597411
900	14492918
920	15224539
940	15791759
960	17501427
980	18072944
1000	18105463
1020	17374155
1040	14793233
1060	13115983
1080	10619106
1100	9221044
1120	7870502
1140	6684057
1160	5844072
1180	5179276
1200	4719445
1220	4174020
1240	4049973
1260	3705400
1280	3346327
1300	3133928
1320	2797036
1340	2764806
1360	2454752
1380	2319723
1400	2167781
1420	1970040
1440	1772095
1460	1533820
1480	1286163
1500	1028745
1520	868331
1540	682438
1560	557477
1580	470627
1600	385155
1620	338039
1640	291116
1660	247254
1680	209168
1700	170660
1720	136078
1740	105877
1760	80391
1780	53869
1800	38514
1820	21960
1840	12190
1860	6273
1880	2701
1900	1350
1920	646
1940	364
1960	174
1980	119
2000	90
2020	33
2040	8
2060	7
2100	3

We can see that the power values are discrete with a 20 W precision. This is consistent with the ExaData documentation which states that values have been obtained via IPMI from a BMC. This measurement system is mostly a failure control system not intended for high precision.

We can also see that all the measurements below 240 W are at 0 W. Let us see on which nodes these measurements come from.

nodes_with_0w_measures = data %>%
  filter(power == 0) %>%
  group_by(node) %>%
  summarize(total_nb_occ = sum(nb_occ))
knitr::kable(nodes_with_0w_measures)

node	total_nb_occ
155	28249

All 0 W measures comes from node 155! As the 0 W values are unexpected and as all values come from the same node, we have decided to filter out 0 W values for the rest of this analysis and power modeling.

data = data %>% filter(power > 0)
cumulated_data = data %>% group_by(node) %>% arrange(power) %>% mutate(
  cum_nb_occ = cumsum(nb_occ),
  node = as.factor(node)
)

Node power modeling

On the previous per-node eCDF power plot, we could see that most nodes have a wide range of power values but that some of them were idle most of the time. This is more clearly shown on the following figure (nodes that have a median power value lower than 450 W are classified as lazy).

nb_measures_per_node = data %>%
  group_by(node) %>%
  summarize(nb_node_measures = sum(nb_occ)) %>%
  mutate(
    node = as.factor(node)
  )

median_power_value_per_node = inner_join(cumulated_data, nb_measures_per_node, by="node") %>%
  filter(cum_nb_occ > nb_node_measures / 2) %>%
  group_by(node) %>% summarize(
    median_power_value = min(power)
  ) %>% mutate(
    lazy = median_power_value < 450
  )


inner_join(cumulated_data, median_power_value_per_node, by="node") %>%
  mutate(facet_label = sprintf("lazy node ? %d", lazy)) %>%
  ggplot() +
  geom_step(aes(x=power, y=cum_nb_occ, colour=node), show.legend = FALSE, direction = 'hv') +
  scale_colour_manual(values=rep("#00000020", nb_nodes)) +
  theme_bw() +
  facet_wrap(vars(facet_label), ncol=1) +
  labs(
    x = "Power (W)",
    y = "Number of occurrences for each node"
  )

SimGrid’s computation hosts power model requires 3 power values : the minimum power of a powered on node (this is typically a CPU sleep state), the power when a tiny amount of work is done, and the power when the node is at full capacity.

As we are not doing a controlled experiment but using existing traces with little information about the applications that ran, we propose to use the minimum and maximum values of each node to instantiate this model.

Here is a unbiased (minimum and maximum values are in the plot, same linear scale for both axes) visualization of the minimum and maximum power values of all nodes. Each node is a point.

minmax_per_node = data %>%
  group_by(node) %>%
  summarize(
    min_power = min(power),
    max_power = max(power)
  ) %>% mutate(
    node = as.factor(node)
  )

p = minmax_per_node %>%
  ggplot() +
  geom_jitter(aes(x=min_power, y=max_power)) +
  theme_bw() +
  labs(
    x = "Node minimum power value (W)",
    y = "Node maximum power value (W)"
  )
p +
  expand_limits(x=0, y=0) +
  expand_limits(x=2100, y=2100)

Here is a zoomed view.

Vertical bands are expected since the x axis range is small and the power values are discrete with a 20 W precision. We can check whether the laziness of nodes impact their minimum and maximum power values.

inner_join(minmax_per_node, median_power_value_per_node, by="node") %>%
  ggplot() +
  geom_jitter(aes(x=min_power, y=max_power, shape=lazy, color=lazy)) +
  scale_color_viridis(end=0.8, discrete=TRUE) +
  theme_bw() +
  labs(
    x = "Node minimum power value (W)",
    y = "Node maximum power value (W)"
  )

Conclusion. Yes, we can clearly see that the maximum power of lazy nodes is smaller than the maximum power of non-lazy nodes. The minimum power of both groups seems similar though.

There seems to be much more non-lazy nodes than lazy nodes. How much exactly?

knitr::kable(median_power_value_per_node %>%
  group_by(lazy) %>%
  summarize(
    nb_nodes_in_group = n()
  ) %>% mutate(
    nb_nodes_in_group_ratio = nb_nodes_in_group / nb_nodes
  )
)

lazy	nb_nodes_in_group	nb_nodes_in_group_ratio
FALSE	960	0.9795918
TRUE	20	0.0204082

As the number of lazy nodes is small (2 % of total nodes) and that remaining close to minimum power consumption value half of the time does not seem to be the normal behavior of HPC nodes, we have decided to filter out lazy nodes for the rest of this analysis and modeling.

Here is a non-zoomed view on the non-lazy nodes.

non_lazy_minmax_nodes = inner_join(minmax_per_node, median_power_value_per_node, by="node") %>%
  filter(!lazy)

non_lazy_minmax_nodes %>%
  ggplot() +
  geom_jitter(aes(x=min_power, y=max_power), size=1/16, alpha=1/4) +
  theme_bw() +
  labs(
    x = "Non-lazy node minimum power value (W)",
    y = "Non-lazy node maximum power value (W)"
  ) +
  expand_limits(x=0, y=0) +
  expand_limits(x=2100, y=2100)

Conclusions

We would have liked to be able to run controlled applications on M100 nodes to have sane values of the power consumption of each node.

Here, only using the ExaData M100 traces, we think that the minimum and maximum power values of non-lazy nodes can be used to generate the power model of nodes similar to M100 nodes. However, these values cannot be used safely since we cannot be sure that the maximum value came from the same application execution. This limitation, in addition to the fact that we realized that the “Usage trace replay” Batsim profile type introduced in Batsim-4.2.0 that we planned to use had poor performance on large platforms such as Marconi100, led us to simply replay the power traces of each job a posteriori, instead of using SimGrid to compute power values during the simulation.

The following code produces the power model file needed by the script that generates the SimGrid platform.

write_csv(minmax_per_node, params$output_power_model_file)