Marconi100 node power consumption analysis and modeling
Introduction
This notebook analyzes the power consumption of the Marconi 100 nodes during 2022 (from 2022-01 to 2022-09), as available in the M100 ExaData trace. This notebook is part of the work that has been conducted for the article “Light-weight prediction for improving energy consumption in HPC platforms” published at Euro-Par 2024. For full context of this work please refer to the article preprint, which is available on [hal long-term open-access link].
The goal of this notebook is to model how the nodes behave in terms of power consumption.
Read the aggregated data
First look, consistency check, filtering
power | node | nb_occ | |
---|---|---|---|
Min. : 0 | Min. : 0.0 | Min. : 1 | |
1st Qu.: 700 | 1st Qu.:248.0 | 1st Qu.: 704 | |
Median :1080 | Median :493.0 | Median : 5859 | |
Mean :1086 | Mean :492.6 | Mean : 15073 | |
3rd Qu.:1460 | 3rd Qu.:738.0 | 3rd Qu.: 19952 | |
Max. :2100 | Max. :979.0 | Max. :1105690 |
nb_nodes = nrow(data %>% select(node) %>% unique())
nb_measures = sum(data$nb_occ)
sprintf("There are %d nodes and %d measures", nb_nodes, nb_measures)
## [1] "There are 980 nodes and 1129106456 measures"
There are 980 nodes, which is consistent with the Marconi100 platform description. Every Marconi100 node should comprise 1 IBM Power9 AC922 @3.1GHz 32 cores CPU and 4x NVIDIA Volta V100 GPUs. Nvidia specifications state that the maximum power consumption of each GPU should be 250-300 W. The 2100 W maximum power value seems reasonable for us for such computing nodes.
However the minimum power value of 0 W is unexpected, as a fully idle node usually consumes more than 0 W.
Let us visualize the distribution of the power measures (regardless of nodes).
all_nodes_agg = data %>%
group_by(power) %>%
summarize(total_nb_occ = sum(nb_occ))
cumulated_all_nodes = all_nodes_agg %>% arrange(power) %>% mutate(
cum_nb_occ = cumsum(total_nb_occ)
)
all_nodes_agg %>% ggplot(aes(x=power, y=total_nb_occ / nb_measures)) +
geom_bar(stat = "identity") +
theme_bw() +
scale_y_continuous(labels = scales::percent) +
labs(
x = "Node power (W)",
y = "Proportion of measures at this value"
)
cumulated_all_nodes %>% ggplot() +
geom_step(aes(x=power, y=cum_nb_occ), show.legend = FALSE, direction = 'hv') +
theme_bw() +
labs(
x = "Power (W)",
y = "Number of occurrences"
)
We can see that the high end of the power values look “long-taily”, which is expected as applications rarely use the maximum power of the nodes.
Let us visualize the distribution (via eCDF) of each node.
cumulated_data = data %>% group_by(node) %>% arrange(power) %>% mutate(
cum_nb_occ = cumsum(nb_occ),
node = as.factor(node)
)
cumulated_data %>% ggplot() +
geom_step(aes(x=power, y=cum_nb_occ, colour=node), show.legend = FALSE, direction = 'hv') +
scale_colour_manual(values=rep("#00000020", nb_nodes)) +
theme_bw() +
labs(
x = "Power (W)",
y = "Number of occurrences for each node"
)
We can see that only a single node seems to have small power values. Let us give a look at the data directly.
power | total_nb_occ |
---|---|
0 | 28249 |
240 | 5 |
260 | 800 |
280 | 40395 |
300 | 187926 |
320 | 676913 |
340 | 3417625 |
360 | 12869948 |
380 | 35668867 |
400 | 46233144 |
420 | 35889046 |
440 | 26661988 |
460 | 24679706 |
480 | 34520539 |
500 | 49192042 |
520 | 56276899 |
540 | 56652677 |
560 | 52290739 |
580 | 42634373 |
600 | 38241782 |
620 | 36501465 |
640 | 39144887 |
660 | 39825940 |
680 | 39815332 |
700 | 36367055 |
720 | 32887778 |
740 | 28969042 |
760 | 23559668 |
780 | 20170272 |
800 | 17025598 |
820 | 16134383 |
840 | 15269585 |
860 | 14878933 |
880 | 14597411 |
900 | 14492918 |
920 | 15224539 |
940 | 15791759 |
960 | 17501427 |
980 | 18072944 |
1000 | 18105463 |
1020 | 17374155 |
1040 | 14793233 |
1060 | 13115983 |
1080 | 10619106 |
1100 | 9221044 |
1120 | 7870502 |
1140 | 6684057 |
1160 | 5844072 |
1180 | 5179276 |
1200 | 4719445 |
1220 | 4174020 |
1240 | 4049973 |
1260 | 3705400 |
1280 | 3346327 |
1300 | 3133928 |
1320 | 2797036 |
1340 | 2764806 |
1360 | 2454752 |
1380 | 2319723 |
1400 | 2167781 |
1420 | 1970040 |
1440 | 1772095 |
1460 | 1533820 |
1480 | 1286163 |
1500 | 1028745 |
1520 | 868331 |
1540 | 682438 |
1560 | 557477 |
1580 | 470627 |
1600 | 385155 |
1620 | 338039 |
1640 | 291116 |
1660 | 247254 |
1680 | 209168 |
1700 | 170660 |
1720 | 136078 |
1740 | 105877 |
1760 | 80391 |
1780 | 53869 |
1800 | 38514 |
1820 | 21960 |
1840 | 12190 |
1860 | 6273 |
1880 | 2701 |
1900 | 1350 |
1920 | 646 |
1940 | 364 |
1960 | 174 |
1980 | 119 |
2000 | 90 |
2020 | 33 |
2040 | 8 |
2060 | 7 |
2100 | 3 |
We can see that the power values are discrete with a 20 W precision. This is consistent with the ExaData documentation which states that values have been obtained via IPMI from a BMC. This measurement system is mostly a failure control system not intended for high precision.
We can also see that all the measurements below 240 W are at 0 W. Let us see on which nodes these measurements come from.
nodes_with_0w_measures = data %>%
filter(power == 0) %>%
group_by(node) %>%
summarize(total_nb_occ = sum(nb_occ))
knitr::kable(nodes_with_0w_measures)
node | total_nb_occ |
---|---|
155 | 28249 |
All 0 W measures comes from node 155! As the 0 W values are unexpected and as all values come from the same node, we have decided to filter out 0 W values for the rest of this analysis and power modeling.
Node power modeling
On the previous per-node eCDF power plot, we could see that most nodes have a wide range of power values but that some of them were idle most of the time. This is more clearly shown on the following figure (nodes that have a median power value lower than 450 W are classified as lazy).
nb_measures_per_node = data %>%
group_by(node) %>%
summarize(nb_node_measures = sum(nb_occ)) %>%
mutate(
node = as.factor(node)
)
median_power_value_per_node = inner_join(cumulated_data, nb_measures_per_node, by="node") %>%
filter(cum_nb_occ > nb_node_measures / 2) %>%
group_by(node) %>% summarize(
median_power_value = min(power)
) %>% mutate(
lazy = median_power_value < 450
)
inner_join(cumulated_data, median_power_value_per_node, by="node") %>%
mutate(facet_label = sprintf("lazy node ? %d", lazy)) %>%
ggplot() +
geom_step(aes(x=power, y=cum_nb_occ, colour=node), show.legend = FALSE, direction = 'hv') +
scale_colour_manual(values=rep("#00000020", nb_nodes)) +
theme_bw() +
facet_wrap(vars(facet_label), ncol=1) +
labs(
x = "Power (W)",
y = "Number of occurrences for each node"
)
SimGrid’s computation hosts power model requires 3 power values : the minimum power of a powered on node (this is typically a CPU sleep state), the power when a tiny amount of work is done, and the power when the node is at full capacity.
As we are not doing a controlled experiment but using existing traces with little information about the applications that ran, we propose to use the minimum and maximum values of each node to instantiate this model.
Here is a unbiased (minimum and maximum values are in the plot, same linear scale for both axes) visualization of the minimum and maximum power values of all nodes. Each node is a point.
minmax_per_node = data %>%
group_by(node) %>%
summarize(
min_power = min(power),
max_power = max(power)
) %>% mutate(
node = as.factor(node)
)
p = minmax_per_node %>%
ggplot() +
geom_jitter(aes(x=min_power, y=max_power)) +
theme_bw() +
labs(
x = "Node minimum power value (W)",
y = "Node maximum power value (W)"
)
p +
expand_limits(x=0, y=0) +
expand_limits(x=2100, y=2100)
Here is a zoomed view.
Vertical bands are expected since the x axis range is small and the power values are discrete with a 20 W precision. We can check whether the laziness of nodes impact their minimum and maximum power values.
inner_join(minmax_per_node, median_power_value_per_node, by="node") %>%
ggplot() +
geom_jitter(aes(x=min_power, y=max_power, shape=lazy, color=lazy)) +
scale_color_viridis(end=0.8, discrete=TRUE) +
theme_bw() +
labs(
x = "Node minimum power value (W)",
y = "Node maximum power value (W)"
)
Conclusion. Yes, we can clearly see that the maximum power of lazy nodes is smaller than the maximum power of non-lazy nodes. The minimum power of both groups seems similar though.
There seems to be much more non-lazy nodes than lazy nodes. How much exactly?
knitr::kable(median_power_value_per_node %>%
group_by(lazy) %>%
summarize(
nb_nodes_in_group = n()
) %>% mutate(
nb_nodes_in_group_ratio = nb_nodes_in_group / nb_nodes
)
)
lazy | nb_nodes_in_group | nb_nodes_in_group_ratio |
---|---|---|
FALSE | 960 | 0.9795918 |
TRUE | 20 | 0.0204082 |
As the number of lazy nodes is small (2 % of total nodes) and that remaining close to minimum power consumption value half of the time does not seem to be the normal behavior of HPC nodes, we have decided to filter out lazy nodes for the rest of this analysis and modeling.
Here is a non-zoomed view on the non-lazy nodes.
non_lazy_minmax_nodes = inner_join(minmax_per_node, median_power_value_per_node, by="node") %>%
filter(!lazy)
non_lazy_minmax_nodes %>%
ggplot() +
geom_jitter(aes(x=min_power, y=max_power), size=1/16, alpha=1/4) +
theme_bw() +
labs(
x = "Non-lazy node minimum power value (W)",
y = "Non-lazy node maximum power value (W)"
) +
expand_limits(x=0, y=0) +
expand_limits(x=2100, y=2100)
Conclusions
We would have liked to be able to run controlled applications on M100 nodes to have sane values of the power consumption of each node.
Here, only using the ExaData M100 traces, we think that the minimum and maximum power values of non-lazy nodes can be used to generate the power model of nodes similar to M100 nodes. However, these values cannot be used safely since we cannot be sure that the maximum value came from the same application execution. This limitation, in addition to the fact that we realized that the “Usage trace replay” Batsim profile type introduced in Batsim-4.2.0 that we planned to use had poor performance on large platforms such as Marconi100, led us to simply replay the power traces of each job a posteriori, instead of using SimGrid to compute power values during the simulation.
The following code produces the power model file needed by the script that generates the SimGrid platform.