Exascale potholes for HPC: Execution performance and variability analysis of the flagship application code HemeLB

Performance measurement and analysis of parallel applications is often challenging, despite many excellent commercial and open-source tools being available. Currently envisaged exascale computer systems exacerbate matters by requiring extremely high scalability to effectively exploit millions of processor cores. Unfortunately, significant application execution performance variability arising from increasingly complex interactions between hardware and system software makes this situation much more difficult for application developers and performance analysts alike. This work considers the performance assessment of the HemeLB exascale flagship application code from the EU HPC Centre of Excellence (CoE) for Computational Biomedicine (CompBioMed) running on the SuperMUC-NG Tier-0 leadership system, using the methodology of the Performance Optimisation and Productivity (POP) CoE. Although 80% scaling efficiency is maintained to over 100,000 MPI processes, disappointing initial performance with more processes and corresponding poor strong scaling was identified to originate from the same few compute nodes in multiple runs, which later system diagnostic checks found had faulty DIMMs and lacklustre performance. Excluding these compute nodes from subsequent runs improved performance of executions with over 300,000 MPI processes by a factor of five, resulting in 190 x speed-up compared to 864 MPI processes. While communication efficiency remains very good up to the largest scale, parallel efficiency is primarily limited by load balance found to be largely due to core-to-core and run-to-run variability from excessive stalls for memory accesses, that affect many HPC systems with Intel Xeon Scalable processors. The POP methodology for this performance diagnosis is demonstrated via a detailed exposition with widely deployed ‘standard’ measurement and analysis tools.


I. INTRODUCTION
Current HPC computer systems have many compute nodes comprising multi-socket multi-core hyperthreaded vectorised turbo-boosted power-throttled CPU processors often with attached accelerators (GPUs and sometimes FPGAs), with multiple levels of caches for accesses to local and NUMA memory of limited size, connected via hierarchical networks with multiple switches. Whereas homogeneous systems were once common, as heterogeneity is incorporated in all aspects to improve energy efficiency, they are now the exception for the largest (and exascale-candidate) systems [1]. Highlyscalable HPC applications running on these systems, however, are mostly still programmed with explicit message-passing using MPI between compute nodes (or sockets) and multithreading using OpenMP within shared-memory domains [2], while exploitation of accelerators covers a wide gamut from standardised OpenMP target offload down to GPU-specific languages such as CUDA and ROCm [3].
Application file I/O is a common scalability bottleneck, while shared parallel filesystems such as GPFS and Lustre typically introduce substantial execution time variability between and within compute jobs. Other major sources of variability arise from system daemon processes that periodically take CPU resources [4], as well as turbo-boost and power throttling that dynamically adjust processing speed of CPUs. All of these sources of run-to-run, node-to-node, core-to-core and iterationto-iteration execution variability typically don't affect correctness 1 , but can have substantial impact on performance. Since the variability is so complex, AI data analytics techniques are proposed to facilitate their analysis [5].
The challenges for application developers presented by this relentlessly increasing complexity of HPC computer systems (and the applications themselves) are troublesome, as efficient scalable codes are predicated on a clear understanding of current and future application requirements (including hardware and software roadmaps) and often take large teams many years. Many nations have recognised this and provide targetted funding to assist: for application developers in Europe, in addition to PRACE/SHAPE 2 (specifically supporting SMEs to parallelise their existing application codes) and related programmes, Horizon2020/EuroHPC funds a variety of HPC Centres of Excellence (CoEs) 3 .
Most of the HPC CoEs address specific application domains, however, the transversal Performance Optimisation and Productivity (POP) CoE 4 supports all fields of HPC, with emphasis on supporting the HPC CoEs as well as European industry (and particularly SMEs since they tend not to have the necessary resources and skills in house). POP provides training in parallel performance analysis, using the partners' own opensource tools as well as third-party tools, in addition to parallel performance assessment and proof-of-concept prototype services which are free-of-charge for application developers from European institutions. This paper examines a POP performance assessment of an HPC CoE exascale flagship application code on one of the European Tier-0 leadership HPC systems at its full scale. The basic POP analysis methodology is demonstrated with the "pretty standard" Scalasca toolset, which identifies compute nodes with severely degraded performance that need to be avoided and then investigates performance variability in widely-deployed commodity Intel Xeon Scalable processors. This variability is subsequently found to substantially impact the achieved performance of this code and several others on many top-tier computer systems, motivating the need for widespread adoption of such methodology and tools to identify and ultimately circumvent these issues.

A. Subject application HemeLB
The open-source HemeLB software 5 developed by University College London and others within the EU HPC CoE for Computational Biomedicine (CompBioMed 6 ) is their flagship solver for high-performance parallel lattice-Boltzmann simulations of large-scale three-dimensional haemodynamic flow in vascular geometries. It supports a range of collision kernels and boundary conditions, and is optimised for sparse, patient-specific geometries. HemeLB has traditionally been used to model cerebral bloodflow, and is now being applied to simulating the fully-coupled human arterial and venous trees with high fidelity [6].
HemeLB has been previously demonstrated to scale exceptionally well up to 100,000 cores [7]. Furthermore, the code was recently found to scale with 80% efficiency on 288,000 AMD 6276 Bulldozer-based Interlagos processor cores of 18,000 Cray XE nodes of NCSA Blue Waters 7 . More recent development focusses on SuperMUC-NG which is based on newer processors with more cores.

B. Execution environment
The Lenovo ThinkSystem SD650 supercomputer SuperMUC-NG 8 at Leibniz-Rechenzentrum (LRZ, Germany) comprises 6480 compute nodes with dual 24-core Intel Xeon Platinum 8174 @ 3.10 GHz ('Skylake') processors [8]. 144 'fat' compute nodes each have 768 GB memory, compared to only 96 GB memory for the remaining 6336 'thin' compute nodes bundled into eight domains (known as 'islands'). The internal interconnect is an Intel OmniPath network, with a fat-tree topology within islands and 1:4 pruned connection between islands. A high-performance 50 TB parallel filesystem is provided by IBM Spectrum Scale (GPFS), with SUSE Linux Enterprise Server (SLES) 12 SP3 operating system.
HemeLB was built with Intel 19.0.4.243 compilers and MPI library. It was configured to use MPI-3 shared-memory windows within each compute node to reduce memory requirements when loading the initial lattice data. For scalability testing, a cerebral arterial circle of Willis geometry dataset of 21.15 GiB was used (corresponding to a lattice spacing of approximately 6.4 microns) [9]. This comprises 1,138,236,832 blocks (1376×1087×761), of which 20,740,240 are nonempty and there are a total of 10,154,448,502 lattice sites. After reading and distributing this dataset, the time to simulate blood flow for 5000 lattice time steps (without writing intermediate or final state) was recorded for this strong scaling benchmark on SuperMUC-NG. 48 MPI processes were executed on each compute node (i.e., one per core, not using additional hardware threads per core) with processes bound to cores and socket-local memory. 9 Memory requirements as reported by the HemeLB code are shown in Figure 1(a). With 768 GiB 'fat' compute nodes, executions ran with 864 to 6,144 MPI processes (on 18 to 128 compute nodes), whereas executions with 12,288 and more MPI processes ran on regular 'thin' compute nodes (with 96 GB memory). Requiring more then 2 GiB per process for executions with 9,216 MPI processes on 192 compute nodes, this configuration was not possible on SuperMUC-NG. 10 Generally only a single execution was done at each scale, during regular operation of SuperMUC-NG, apart from the very largest runs requiring more than half of the total compute nodes which could only be done in a special dedicated session after system maintenance. Since the initial runs requiring more than 3,072 compute nodes repeatedly performed poorly, measurements were taken and analysed to identify the cause, such that runs could be done which delivered performance in line with expectations.

C. Scalability of simulation time and efficiency
Simulation time for different numbers of compute nodes is plotted in Figure 1(b), along with a dotted line representing perfect linear scaling. Simulation time speedup relative to the smallest execution configuration (18 compute nodes) is more than 190-fold obtained for 360-fold increase in compute nodes, with more than 80% scaling efficiency to over 100,000 processes.
With 792 compute nodes bundled within island domains, and islands connected via an additional switch, it is no- reduced. Small numbers of failed compute nodes throughout SuperMUC-NG can therefore conveniently be avoided when allowing full flexibility in allocating partitions.
HemeLB executions have been audited on different computer systems by the POP CoE. These performance assessments using a methodology [10] based on measurements taken with the highly scalable open-source Scalasca/Score-P toolset [11] found very good computation scaling and communication efficiencies, while identifying memory consumption and load balance as key issues.
Score-P was used to prepare a HemeLB executable where (by default) application routines are instrumented by the compiler and special measurement libraries are linked interposing on MPI library routines. Initial measurements suffered from excessive overheads due to very frequent executions of MPI_Comm_size, MPI_Comm_rank and lots of small C++ methods (particularly from the standard libraries). A custom installation of Score-P was therefore made to avoid instrumentation of the uninteresting MPI routines, and the Intel compiler directed to only instrument key application routines specified via a file listing their signatures. While this brought measurement dilation down to an acceptable level of around 3%, execution tracing was still prohibitive due to the huge amount of MPI operations during HemeLB initialisation. The application code was therefore manually annotated with directives to pause recording when executing the initialisation phase, allowing detailed measurement and analysis of the subsequent simulation phase.
Scalasca analysis reports can be interactively explored using the CUBE GUI. Using an execution with 13824 MPI processes as an example, Figure 2 shows efficiency metrics calculated by CUBE for the selected callpath, in this case RunSimulation where 32% of execution time is spent: most of the execution time is initialisation, captured here as MEASUREMENT OFF.
CUBE presents the detailed measurements and analyses from Scalasca/Score-P in various ways. A three-pane presentation of metrics, callpaths and processes/threads in hierarchical trees is shown in Figure 2 -where selections determine values shown in panels from left to the right, and expanding tree nodes reveals the next level of internal detail -to naturally support exploration of huge amounts of execution performance data.

D. Localisation of degraded performance
For the initial measurement with 300,000 MPI processes the initial CUBE assessment of execution efficiency shown in Figure 3 already highlit very poor load balance of 0.49 (whereas communication efficiency remained a very good 0.97). This was confirmed by examining the computation time per process distribution statistics, whereas the number of instructions completed (from the PAPI_TOT_INS counter captured in the measurement) showed very little variation confirming that all processes did roughly the same amount of useful computation.
However, another hardware counter metric for resource stall cycles (PAPI_RES_STL) shown in Figure 4 provided insight into the imbalance problem. On one compute node, all 24 of the MPI processes with odd-numbered ranks in the MPI_COMM_WORLD global communicator suffered 12 times the mean number of resource stall cycles. Closer examination also revealed a second compute node with elevated resource stall cycles for its even-numbered MPI ranks compared with all of is peer processes. These same two compute nodes were also the culprits in the other measurements with degraded performance, and after explicitly excluding them from sub- Fig. 2. Scalasca/CUBE analysis report explorer presentation of HemeLB execution on SuperMUC-NG with 13,824 MPI processes. From the hierarchy of metrics in the left panel, which includes metrics determined from analysis of the execution trace, time exclusively for computation has been selected, and this metric shown in panels to the right. From the call-tree hierarchy in the middle panel, the HandleActors part of RunSimulation has been selected, and computation time exclusively for this callpath is presented in the right panel for each MPI process. The times for each MPI process have been sorted and shown as percentages of that with the largest value, rank 3474 in MPI COMM WORLD. Next to each numeric value in the trees is a small box coloured from white through yellow and orange to red according to its percentage of the total metric value to facilitate identification of those which are most significant. Below the main window is an identically coloured topology map, where each column corresponds to the 48 processes (increasing top to bottom) on a compute node (increasing to the right), from which processes 3474 and 7782 stand out from the others with their much higher computation times. Overlaid on the main window are two additionally detached tabs: one showing the distribution of computation times of the processes, and the other an assessment of RunSimulation execution identifying load balance efficiency of only 0.49 as the critical factor.  For this combination, the metric values for all processes, as percentages of the largest peer value, are shown in the System tree panel and separately in the (detached) Statistics chart and Process topology panel (folded such that each row has the processes of ten compute nodes). It is immediately clear that all of the 24 processes with odd MPI ranks in MPI_COMM_WORLD on node i02r08c05s07 have more than 14x the resource stall cycles of any others, and closer examination further reveals node i04r01c05s10 also has values for its even MPI ranks considerably elevated beyond all others. Global scaling efficiency is the product of parallel efficiency and computation scaling.
Parallel efficiency is the ratio of mean computation time to total runtime of all processes. Load balance efficiency is the mean/maximum ratio of computation time outside of MPI.
Communication efficiency is the ratio of maximum computation time to total runtime. Serialisation efficiency is estimated from idle time within communications where no data is transferred.
Transfer efficiency relates to essential time spent in data transfers. Computation scaling is the relative total time in computation (outside of MPI).
Instructions scaling is the relative total number of instructions executed (outside of MPI). IPC scaling is the relative value of instructions executed (outside of MPI) per CPU cycle. (Scaling efficiencies are relative to a serial execution or the smallest parallel execution configuration.) sequent executions 5 times faster simulation performance was obtained.
Diagnostic checks run on those nodes by the system administrators after they had been taken out of production identified that they had faulty DIMMs and lacklustre performance, needing the DIMMs to be replaced.

E. Efficiency analysis
For the updated Scalasca/Score-P measurements of HemeLB executions on SuperMUC-NG with up to 6,452 compute nodes (309,696 MPI processes), Table I summarises their performance assessment. Computational instructions retired per clock cycle (IPC) was a reasonable 1.9, compared to 1.4 for the smaller execution configurations, suggesting better cache efficiency as the lattice partitions get smaller. Perfect instruction scaling up to 768 compute nodes thereafter deteriorates as there is more processing of lattice block boundaries compared to their interiors. Since these two effects counteract each other, very good computation scaling above 0.87 is sustained. Efficient non-blocking communication to exchange fluid particles between neighbouring lattice blocks maintains excellent communication efficiency above 0.97. The most significant inefficiency at all scales tested is load balance, generally around 0.80 but dropping to 0.72 in some larger execution configurations. While this is still fairly good, it presents the largest opportunity for performance improvement and warrants more in-depth investigation.

F. Investigation of load balance
Investigation of the breakdown of (MPI) communication versus computation time of each process showed that one or a few processes spend essentially no time waiting for the non-blocking point-to-point MPI communication to complete during the simulation phase. Since these processes are fully occupied with computation for longer than the others, by the time they reach the MPI_Wait the incoming messages for them have already been received into the corresponding buffers. This is characteristic of computational load imbalance.
The HemeLB 'basic decomposition' method for splitting the lattice grid into parts for each process was already known to result in a notable imbalance, with many of the higher-ranked processes given disconnected sections that result in additional computation for them. The decomposition is expected to be deterministic, however, allowing comparison between runs. This was verified by comparing the numbers of lattice blocks and sites allocated to each process, which matched exactly. Similarly, the number of messages and the amount of data sent and received by each process also matched exactly, verifying that the pattern of communication was also identical.

G. Computation time variation
In different runs with the same number of processes the communication time can be expected to vary, both due to the mapping of processes to different compute nodes distributed throughout SuperMUC-NG (in some cases requiring additional communication inter-island hops via top-level switches) and also contention on those communication paths shared with other applications concurrently executing on the system [12]. 11 While the first aspect can be addressed by forcing execution on identical machine partitions, e.g. doing multiple runs consecutively within a single job ( Figure 5), avoiding possible communication interference from other applications can only be achieved via dedicated use of islands or the entire computer system (i.e., single user execution). Figure 5 shows the amount of HemeLB work per MPI process, represented by the number of blocks of lattice sites they were assigned (which is identical in both runs), and the corresponding measured computation time (which varies 11 There is no isolation of job partitions as on IBM Blue Gene systems. only slightly apart from outliers, the worst of which are circled in each run). While the amount of computation and communication are not perfectly balanced, they do not change between runs. The mean computation time in both cases is 810 seconds, however, it is more relevant to compare the outliers to the usual maximum of approximately 1000 seconds (i.e. from processes with no significant memory stall cycles): the outliers of 1144 and 1132 seconds therefore result in a degradation of around 14%. Of course, MPI waiting time of each process also depends on when the corresponding partner processes are ready to initiate point-to-point communication, which can be delayed when they require longer for their computational tasks.
Comparison of the number of instructions executed by each process (from the PAPI_TOT_INS=INSTRUCTIONS_RETIRED hardware counter) 12 showed little variation between the processes that took longer and the others, suggesting that each process did very similar amounts of work.
Comparison of the number of cycles executed by each process (PAPI_TOT_CYC=CPU_CLK_THREAD_UNHALTED) to the number of reference cycles executed (PAPI_REF_CYC=UNHALTED_REFERENCE_CYCLES) showed that they correlated perfectly, indicating that (dynamic) variation in clock frequency -due to turbo-boost, throttling due to thermal constraints, and/or the use of AVX512 vector instructions -did not occur. This was to be expected since AVX512 vector instructions were not used 13 , and the compute nodes were specifically configured for reproducible performance with turbo-boost and energy-12 not counting those when within MPI routines, which are not useful computation 13 this is planned as future optimisation work, following the current performance assessment conserving optimisation 14 disabled, but naturally needed to be verified.
Computation can also be disrupted by system noise and jitter from daemon processes running on compute nodes, however, this could impact different compute nodes in every execution. This was eliminated as being unlikely, since the processes which were slow in any execution were verified to be slow throughout the Simulate phase. Figure 6 shows the HemeLB execution trace collected by Scalasca/Score-P in a timeline visualisation by Vampir 15 showing the same process(es) taking longer for each and every timestep, as necessary for them to have essentially no MPI waiting time. And when those processes perform as expected in a consecutive execution within the same job, there is nothing to indicate that the processor (or indeed individual core) is deficient.

H. Core performance
So far it had been verified that the computation and communication were completely deterministic, yet the time required for them varied somewhat from run to run due to very pronounced variations between processes which change in each run. A few processes require considerably longer for the same amount of computation, such that they had no need to wait for communication with the others which had correspondingly large amounts of waiting time.
To investigate deeper into the processor (core) execution behaviour, further measurements were done incorporating additional hardware counters in a top-down fashion as recommended by Intel [13]. It was immediately clear that slow processes correlated with larger numbers of resource stall cycles (PAPI_RES_STL=RESOURCE_STALLS:ANY), which cover both front-end and back-end pipelines, which was followed by back-end stall cycles (CYCLE_ACTIVITY:STALLS_TOTAL), and ultimately most strongly correlated with stall cycles waiting for memory (CYCLE_ACTIVITY:STALLS_MEM_ANY).
Notably, counters for the various levels of cache (L1D/L2/L3), DRAM and NUMA accesses were all very low and not strongly correlated with the slow processes.

I. Comparison with other HPC computer systems
Lacking further ideas for progress on SuperMUC-NG, it was decided to see whether other similar computer systems would be able to reproduce this variability issue. The JUWELS regular cluster nodes have almost identical dual 24-core Intel Xeon Platinum 8168 'Skylake' processors, but a Connect-X4 EDR Infiniband interconnect from Mellanox (opposed to Intel OmniPath), and are running a CentOS 7 Linux kernel. JUWELS accelerated cluster nodes only differ in that they have dual 20-core Intel Xeon Gold 6148 'Skylake' processors plus four Nvidia V100 'Volta' GPUs. 240 large-memory compute nodes have 192 GiB, with the others having only 96 GiB (as on SuperMUC-NG). Although the same Intel compiler and MPI were available, HemeLB executions failed to run fully to completion due to a bug handling MPI windows, therefore ParaStationMPI was used instead. 16 While the HemeLB communication on JUWELS standard nodes was somewhat slower than on SuperMUC-NG, it showed the very same characteristic pattern with one or a few processes having very low MPI waiting time, yet varying from run to run. Similarly, the computation itself was generally slightly faster than on SuperMUC-NG but taking notably longer for those processes with little waiting time and correlating with the corresponding hardware counters, particularly that for stall cycles waiting for memory.
On the JUWELS accelerated nodes, running 20 MPI processes per socket (instead of 24) 17 , and on the DEEP-EST prototype Data Analytics Module (DAM) with dual 24-core Intel Xeon Platinum 8260M 'CascadeLake' processors and Extoll Tourmalet interconnect, the same pattern of performance variation between cores was also observed.

A. Origin of performance variability
Discussion with the system administrators at this point brought to light the fact that significant run-to-run performance variability had been recently reported with the xPic particlein-cell application [14] on JURECA-Booster (Intel Xeon Phi 'Knights Landing (KNL)') which seemed to match symptoms that had been reported when benchmarking HPL [15]. TACC Stampede2 comprises both Intel Xeon 'Skylake' and Intel Xeon Phi 'KNL' processors which both showed significant HPL and DGEMM performance variations from run to run, that are due to elevated snoop filter conflicts arising from the mapping of physical memory pages into the L3 cache based 16 configured to avoid excessive memory requirements for MPI message buffers: PSP_UCP=1, UCX_TLS=sm,self,ud 17 therefore still one process per physical core, and not using the GPGPUs on a proprietary hashing mechanism, as also documented by Intel [17]. The off-core memory controller hardware counters which can be used to confirm this are priviledged and therefore not accessible to user codes, as well as not being directly related to cores or application processes. Further investigation and analysis [16] also determined that, while the 24-core Skylake and 68-core KNL are apparently the most seriously impacted, Intel Xeon Scalable processors with non-power-of-two cores are all apparently affected to varying extents by this issue. Huge memory pages 18 of 1 GiB, which are Intel's workaround to optimise HPL performance, were determined to be effective, however, these are not yet available on Stampede2, SuperMUC-NG, JUWELS, or other production systems with similar processors. On HPC clusters where other proposed workarounds such as disabled hardware threading or sub-NUMA clustering (SNC) 19 is configured, identical performance variability was encountered.

B. Impact on HPC applications
While the occurrence of this issue can be considered to be rare, with only one in every ten or one hundred thousand cores being affected with a slowdown of more than 10%, the impact on large-scale parallel applications is very significant. Since these are characterised by occassional (and often more frequent) synchronisations, whether via global collective operations or indirectly via interactions with neighbours, all processes will be held back waiting on the slowest. The end result is that an additional 10% or more of the entire computing time is required for busy waiting, with corresponding impact on energy consumption and system throughput. Performance profiles of executions of the SPECFEM3D seismic wave propagation simulation application 20 taken on the Skylake compute nodes of the Irène Joliot-Curie supercomputer also show the same behaviour and impact on performance at large scale, therefore the issue is likely to be widespread but perhaps only identified when explicitly looked for.
A wide variety of commercial/vendor and open-source performance tools are available which could be used to identify such situations: the applicability and capabilities of many are summarised in [18]. The tools used and demonstrated in this paper were found to be particularly convenient, supporting (very) large-scale measurements and analyses, however, the performance variability and variation by process/core were also verified using Intel Parallel Studio XE 21 (notably Application Performance Snapshot and Amplifier/VTune), although limited to only modest scale.
System-level job-reporting and monitoring tools operate on leadership HPC systems, including PerSyst [19] at LRZ and LLview [20] at JSC, however, these currently don't capture and present the RAS data required to identify the performance issues encountered in this work. Since the performance issues are related to application-specific access patterns to GHP = Gigabyte Huge Pages enabled providing backing for application address space Uncore = (priviledged) access available to Intel system agent non-core hardware counters such as memory controller and snoop agent xPic = worst observed slowdown relative to best (minimum) execution on the same processor Other codes = multi-node MPI applications that manifest similar non-deterministic slowdown (5%+) for individual processes/cores large amounts of memory, for processes/threads executing on individual processor sockets/cores and only during particular execution phases, this will require more fine-grained data collection and analysis.

C. Application malleability and reactivity
Avoidance of suspect processors when submitting jobs cannot be done when the affected processors/cores are not determinable in advance, necessitating dynamic load re-balancing after an application has already started execution. Fortunately, once affected processes been identified and have all had their effective load re-balanced (which might require more than one iteration), it is expected that efficient execution will be unimpeded thereafter (unless additional or different memory is subsequently accessed).
Chameleon [21] and DLB [22] are examples of libraries being developed for reactive load balancing. Both rely on applications incorporating hybrid task parallelism via OpenMP to allow dynamic migration of work (tasks) between threads in SMP compute nodes after self-introspection and analysis, which can be more efficient than standard OpenMP dynamic loop scheduling but similarly damages data locality when work is migrated. For persistent load imbalances, as encountered in this study, a customised lightweight migration strategy needs to be developed.

IV. CONCLUSION
Performance analysis of (prospective) exascale applications has long been recognised to present significant challenges, particularly with extreme scale and ever increasing complexity.
Inherent performance variability has always been an issue for HPC parallel applications, from network contention, shared filesystems, intrusive system daemons, energy policies, etc., for which corresponding amelioration strategoes are well documented in best practices. However, application-specific memory and processor core performance variablity exacerbates this, and now (large-scale) executions have neither predictable, repeatable nor reproducible performance, even for immediately following runs on the exact same hardware! The task for performance analysts is thereby made much more difficult.
For robust exa-scalability, applications are likely to need to dynamically adapt to both hardware failures and performance degradations that would otherwise cripple executions. Performance tools will therefore also need to become more dynamic in not only measuring parallel executions, but also providing analyses of execution inefficiencies directly to applications on the fly so that they can be addressed promptly.

A. Artefact Description
The HemeLB application was run with a 6.4μs resolution circle of Willis geometry dataset on LRZ's SuperMUC-NG supercomputer, and its execution performance and scalability to over 300,000 MPI processes measured and analysed via runtime summaries and execution event traces from the system's 6,480 dual 24-core compute nodes. a) Software Artefact Availability: Some author-created software artefacts are NOT maintained in a public repository. C. Modifications made for the paper a) Score-P configuration: A custom installation specifying MPI_CPPFLAGS=-DSCOREP_MPI_NO_MINI, to avoid generating wrappers for MPI_Comm_rank/size (required for trace collection/analysis), was configured with the PAPI library and for Intel MPI and compilers. b) Instrumentation: The HemeLB application dependencies were built without modification. Source code of the HemeLB application src/main.cc was annotated with Score-P measurement control API macros (SCOREP_RECORDING_OFF/ON) to pause event recording during the initialisation phase, and configured using CXX=scorep-icpc with environment settings to direct the Score-P instrumenter c) Measurement: Job execution via SLURM scripts specified --ear=off to have access to hardware counters in measurements and set the processor 'performance' profile disabling dynamic frequency changes.
--ntasks-per-node=48 and --cpus-per-task=1 allocated one MPI process to each physical core for the specified number of compute nodes. --cpu-bind=cores and --mem-bind=local bound each process to a dedicated core and local (socket) memory.
Exclusion clauses were also added to get allocations avoiding various compute nodes as necessary, e.g., --exclude=i02r08c05s07,i04r01c05s10.
Runtime environment variables for Score-P measurement: SCOREP_DEVELOPMENT_MEMORY_STATS=aggregated SCOREP_METRIC_PAPI=PAPI_TOT_INS,PAPI_TOT_CYC, PAPI_REF_CYC,PAPI_RES_STL SCOREP_TIMER=gettimeofday SCOREP_TRACING_USE_SION=true SCOREP_TRACING_MAX_PROCS_PER_SION_FILE=576 SCOREP_TOTAL_MEMORY was also set as required based on the Score-P memory usage statistics reported, e.g., to 200MB for the runtime summary execution configurations with 300,000 and more processes. Simulation execution dilation (compared to uninstrumented reference execution) was less than 4%. d) Analysis: For Scalasca automated trace analysis, event timestamp consistency correction was applied via SCAN_ANALYZE_OPTS="--time-correct".
Analysis report exploration with CUBE GUI and execution trace exploration with Vampir was done on local clusters and a notebook computer.