Understanding the Impact of Synchronous, Asynchronous, and Hybrid In-Situ Techniques in Computational Fluid Dynamics Applications

High-Performance Computing (HPC) systems provide input/output (IO) performance growing relatively slowly compared to peak computational performance and have limited storage capacity. Computational Fluid Dynamics (CFD) applications aiming to leverage the full power of Exascale HPC systems, such as the solver Nek5000, will generate massive data for further processing. These data need to be efficiently stored via the IO subsystem. However, limited IO performance and storage capacity may result in performance, and thus scientific discovery, bottlenecks. In comparison to traditional post-processing methods, in-situ techniques can reduce or avoid writing and reading the data through the IO subsystem, promising to be a solution to these problems. In this paper, we study the performance and resource usage of three in-situ use cases: data compression, image generation, and uncertainty quantification. We furthermore analyze three approaches when these in-situ tasks and the simulation are executed synchronously, asynchronously, or in a hybrid manner. In-situ compression can be used to reduce the IO time and storage requirements while maintaining data accuracy. Furthermore, in-situ visualization and analysis can save Terabytes of data from being routed through the IO subsystem to storage. However, the overall efficiency is crucially dependent on the characteristics of both, the in-situ task and the simulation. In some cases, the overhead introduced by the in-situ tasks can be substantial. Therefore, it is essential to choose the proper in-situ approach, synchronous, asynchronous, or hybrid, to minimize overhead and maximize the benefits of concurrent execution.


I. INTRODUCTION
Computational fluid dynamics (CFD) is a branch of engineering physics that analyzes fluid flow problems using numerical methods.It is used to solve a wide range of problems in both research and industry.Examples include nuclear reactor flow analysis [24], biological flows e.g.food and drug administration (FDA) nozzle benchmark [28], and flow simulations around a wing for modern civil aircraft design [15].Analyzing the flow around objects in realistic problems entails the analysis of turbulence, which is a flow regime characterized by complex, non-linear, and seemingly random fluid motions on multi-scales.Because the energy in turbulence is dissipated through viscosity at small scales, the discrete domains used to solve such problems must be large enough to capture the large scale motions and fine enough to capture the smallest ones.Direct Numerical Simulations (DNS) allow for full flow resolution but require domains with the order of hundreds of million grid points and hundreds of thousand time steps [32].
In this paper, we focus on Nek5000 [2], a spectral element method-based code with excellent scalability [25] partially due to the weak element coupling and its "matrix-free" formulation.This enables the solution of large problems without the need to explicitly construct any matrix operator, which would have restrictive sizes due to the large number of grid points in turbulence simulations.The code stores and processes information on a "local domain" basis, which means that each element of the discretization is handled separately from the others, and a conciliation operation known as direct-stiffness summation is performed regularly to ensure continuity.This feature provides a great flexibility in preparing the data for post-processing, as such data analysis can often be performed locally without the need for additional communication.
CFD applications can fully utilize the computational power of Exascale High Performance Computing (HPC) systems with optimized data structure and parallelization.However, not only is the computational cost for CFD high but also is the amount of data generated, which grows with the size of the problem and need to be stored via the input/output (IO) subsystem for subsequent analysis (visual or numerical) as well as for checkpoint/restart operations.While the computational performance of HPC systems is rapidly increasing, the corresponding IO performance grows more slowly, and storage capacity is also limited.So the large IO operations, typical of the standard CFD workflows, may result in significant overheads.
An alternative to reduce or prevent this overhead, is to perform analysis while the simulation is running and the data resides in the HPC system's memory.This type of approach is known as an in-situ approach [11].In-situ techniques can reduce IO throughput and storage requirements, while improving overall simulation and data analysis performance.However, because computational resources must be shared between simulation and in-situ tasks, in-situ approaches may introduce new overheads.As a result, before deploying in-situ methods, the trade-off between reduced IO requirements and increased workloads must be carefully considered.
Three main approaches can be identified based on how resources are shared and synchronized between simulation and in-situ tasks.In-situ tasks can be performed in a synchronous, asynchronous, or hybrid manner, the latter combining synchronous and asynchronous approaches as shown in Fig. 1.
In the synchronous in-situ approach, the simulation stops at well-defined intervals, allowing the in-situ task to run, and resumes after the end of the in-situ task (cf.Fig 1(a)).The in-situ task typically uses the same resources as the simulation, and by using appropriate data layouts, data copying can often be avoided, ideally, the data will still reside in the cache, making the in-situ processing very efficient.However, if the in-situ task requires a different data layout than the simulation, the caching may be destroyed, and the performance subsequently reduced.Due to overheads in the in-situ task, using all resources may be suboptimal, depending on the scalability of the in-situ task.It can also be difficult to decouple the in-situ task from the simulation because it relies on simulation functions and data structures, making it tightly coupled with the simulation and thus requiring synchronous execution.
As shown in Fig. 1(b), the asynchronous in-situ approach uses separate computational resources for the simulation and in-situ task, allowing the simulation to transfer the data to the in-situ resources and then continue with the simulation while the in-situ task runs concurrently on the other resources.This allows the simulation and in-situ task to be decoupled, with a suitable amount of resources assigned to the in-situ task.However, it usually necessitates additional data copying, and determining how to divide available resources between simulation and in-situ tasks can be difficult.The suboptimal choice could result in load imbalances and consequently unnecessary waiting times.
The hybrid in-situ approach (cf.Fig. 1(c)) combines the two, with the in-situ task divided into two parts.The first part is typically executed synchronously, followed by an asynchronous second part.In this approach, the synchronous part is executed on the same resources as the simulation (which pauses for the duration of the in-situ task), and then the required data are sent to the separate resources for the asynchronous part, after which the simulation can resume.While it is more difficult to design, it can overcome the disadvantages of the previous approaches by allowing subtasks that benefit from synchronous or asynchronous execution in their preferred model.
As preciously discussed, these three in-situ approaches have different advantages and disadvantages, making it difficult to choose the best approach based on the characteristics of the simulation and in-situ tasks.
In this paper, we investigate the impact of these three approaches on three common, yet very different in-situ tasks in CFD (compression of checkpoint/restart files, visualization, and uncertainty quantification) using a real CFD use case at scale (turbulent flow in a bent pipe [16]).The following are the paper's specific contributions: 1) it proposes both structures and examples for combining synchronous, asynchronous, and hybrid in-situ tasks and simulation; 2) it presents three novel real-world case studies of insitu tasks to large-scale simulation on CPU systems, including a new physics-based lossy data compression method; 3) it analyzes critically which in-situ approach adds the least overhead to the simulation and achieves the best overall performance, generating experimental evidence for future model-based approaches.The rest of the paper is organized as follows: Section II contains a summary of related works on in-situ techniques and case studies; Section III introduces the paper's selected insitu workflows and use cases; Section IV contains information about the experimental setups; Section V presents results, and analyses; Section VI summarizes and discusses this paper.

II. RELATED WORK
In-situ processing is gaining popularity, particularly in visualization and analysis, and several in-situ systems have been developed.VisIt with Libsim [10], [19] and ParaView with Catalyst [6] are two in-situ systems for synchronous data visualization.SENSEI [7] is a generic in-situ interface that provides adaptors for connecting simulation to other in-situ systems like VisIt with Libsim and ParaView with Catalyst.It supports synchronous as well as asynchronous in-situ data analysis.Because these systems rely on the Visualization Toolkit (VTK) data format [29], they can barely be used for tasks other than visualization.
Originally designed as a higher-level IO abstraction, the Adaptable IO System (ADIOS) [14], [21] can also be used for in-situ processing.It is not dependent on VTK and supports arbitrary data formats, making it an excellent candidate for a generic in-situ framework.It also supports both synchronous and asynchronous in-situ tasks.As a result, we selected ADIOS as the framework for the work described in this paper.
Several papers discuss the use of in-situ processing (mostly for visualization purposes) in CFD applications.Maulik et al. [23] evaluated the performance and scalability of three cases of OpenFOAM simulation with PythonFOAM performing Python-based synchronous data analysis.Ayachit et al. [5] visualized the simulation results from the PHASTA science proposed a hybrid in-situ approach for large-scale data analysis in a massively parallel turbulent combustion code (S3D) and compared the performance of their hybrid approach with a synchronous approach.
In contrast to these studies, we examine the suitability of all three in-situ approaches, synchronous, asynchronous, and hybrid, on large HPC systems using real large-scale CFD use cases and critically discuss their suitability based on the characteristics of the in-situ tasks.

III. METHODOLOGY
We use the incompressible spectral-element Navier-Stokes solver, Nek5000, and three common in-situ tasks to investigate the impact of in-situ tasks on CFD simulations: lossy and lossless compression, data visualization with image generation, and data analysis for uncertainty quantification.In this section, we will present our synchronous, asynchronous and hybrid in-situ workflow and introduce these three uses cases.

A. In-Situ Workflow
Both asynchronous and hybrid in-situ approaches require data transfer from the simulation to the in-situ tasks, and rather than using custom communication approaches, we use the ADIOS2 [14] framework, which provides APIs in Fortran, C/C++, and Python as well as several in-situ functions, including the "insituMPI" engine for MPI programs, thus providing a stable framework for similar approaches in other contexts.As previously stated, ADIOS2 does not rely on the VTK data format, allowing for easier integration and avoiding the need for additional data copying to the VTK format.
The first technical issue for in-situ approaches is compiling and linking the simulation codes with the in-situ tasks; this is frequently difficult because the simulation and in-situ tasks may be programmed in different languages.To address this issue, we use adaptor functions in our workflow design (Fig. 2-4), which are wrappers written in the programming languages used to code the simulation and/or in-situ task.To enable in-situ processing, three groups of functions must be implemented in the simulation: initialization, check and finalization.These functions connect the in-situ tasks to the simulation solver, Nek5000.The initialization is implemented with the * init functions.During the initialization phase, the simulation solver and in-situ task call them.These functions set necessary parameters, such as the size of the data to be processed or transferred during the in-situ step, based on information from the simulation solver and the in-situ task.The finalization is accomplished with the * end functions.These finalize the in-situ setup safely, free up all used memory, and print out profiling information on the in-situ tasks.The Check is a group of * check functions that are called in each step when the in-situ task is executed.
In the synchronous approach, the simulation results are passed to the in-situ task via the adaptor functions, as shown in Fig. 2. The simulation is stopped during in-situ execution, thus data consistency is guaranteed automatically.If the in-situ task and simulation use the same data structures, no additional data copying is required.If, on the other hand, the in-situ task requires a different data structure, the adaptor functions may need to include a deep copy.Because the simulation solver and the in-situ task share computing resources, the core used to perform the in-situ task already has the simulation results.As a result, we do not need to use the ADIOS2 library to transfer data between cores.
In the asynchronous approach, the simulation results are sent to the in-situ task via the writer and reader pair based on the "insituMPI" engine from ADIOS2 (Fig. 3).The simulation solver and the in-situ task need to be launched concurrently in a multiple-program multiple-data (MPMD) mode.The simulation and in-situ task workloads are distributed across separate computational resources.Given the total number of resources (e.g.cores) available, N , they can be assigned in various chunks to the simulation p sim and in-situ task p insitu such that p sim + p insitu = N .The simulation sends the required data to the ADIOS2 writer via adaptor functions.To ensure data consistency, the simulation waits for the end of the MPI communication.The in-situ task is executed concurrently with the simulation after receiving the data from the ADIOS2 reader.If the simulation solver and the data processor have different structures, the adaptor functions that connect the data processor and the reader perform the necessary adaptations.
The hybrid in-situ approach depicted in Fig. 4 is divided into synchronous and asynchronous components.The adaptor functions, like the synchronous approach, pass the simulation results to the first synchronous part of the in-situ task.Following the synchronous portion, intermediate data is sent to the second portion of the in-situ task via ADIOS2, as in the asynchronous approach.The simulation solver is directly compiled and linked with the synchronous part of the in-situ task in this approach, and it is launched in MPMD mode with the asynchronous part of the in-situ task.

B. Use Cases
We consider three use cases with various characteristics from large CFD simulations to assess the efficiency of our in-situ approaches: a) Lossy and lossless compression: CFD simulations are frequently long-running, producing potentially large amounts of output data for post-processing or check-pointing/restart mechanisms.Compressing the data before storing it is one way to reduce the storage requirements of a simulation.According to Li et al. [20], many types of compression can be applied to data sets, but in this study we only distinguish between lossless compression, where no information from the original data is discarded, and lossy compression where there is no demand for the reconstructed data set to exactly match the original one, introducing errors but allowing for higher compression ratios.Ideally, lossless compression may be preferred in all cases, as scientists prefer to have undisturbed data for any necessary analysis.However, turbulence is characterized by seemingly random motions, which add a level of complexity for lossless encoders that typically rely on finding patterns in the data, reducing their ability to perform significant compression.As a result, lossy compression is widely regarded as a viable alternative.Turbulence is a chaotic multi-scale phenomenon in which there are motions at various frequencies, but only a few of them ultimately possess the majority of the energy in the flow.It is possible to keep only the data associated with the most energetic motions in the flow while discarding the rest using a method proposed by Otero et al. [27].This allows for the data to be reconstructed with reasonable accuracy.Another advantage of this physics-based method is that the user can specify allowed error in the reconstructed data set in advance, and compression will take place element wise accordingly.
This physics-based method is inspired by the JPEG compression standard [31], with the exception that it employs the Discrete Legendre Transform (DLT) rather than the Discrete Cosine Transform (DCT).This specific transformation is chosen in order to benefit from the Gauss-Lobatto-Legendre (GLL) points that are used in the spectral element discretization [12].The DLT is used to transform the original data into spectral space, and low energy spectral coefficients are systematically discarded in an inherently lossy step known as truncation.While the truncation is taking place, we use the orthogonality and other properties of the Legendre basis to evaluate the error incurred on the original data set without transforming back into the physical domain, lowering the computational cost of the method.Lossy compression is completed by using Lossless Huffman encoding or another suitable method and writing the data to disk.For the latter task, we use the functions for lossless compression and IO from the ADIOS2 library as part of the in-situ data processor even in the synchronous approach because the data compression is a special in-situ task.The ADIOS2 library is not used in the synchronous approach in the later cases.
Ideally, the data compression should take relatively short time since it is a fully local operation, and could compress the data to a certain degree while keeping the sufficient accuracy.Because of this, we have chosen this case as one example of in-situ tasks with low computational cost.And due to the reusage of the simulation functions of the lossy compression, this use case is also one example of in-situ tasks, which is partly deep coupled with the simulation.
b) Visualization with ParaView/Catalyst: In-situ visualization can eliminate the need for intermediate data storage for the visualization (often postmortem), improving the overall simulation and visualization efficiency.However, because of the required collective communications, the visualization task frequently scales much less than the simulation.As a result, synchronous in-situ approaches can be problematic with the MPI collective communication, as shown by Atzori et al. [4].Using the asynchronous in-situ approach, it is thus advantageous to assign a smaller set of resources to the in-situ task than to the simulation.We use ParaView/Catalyst as an image generator.The general image generation workflow can be expressed as follows: the VTK grid for ParaView-based visualization is generated during the initialization phase, and a customized ParaView Pipeline Python script is read.This Python script defines how the ParaView/Catalyst coprocessor renders the output image using information such as camera position, image size, and slice position.
According to the bottleneck diagnosed by the previous study, this image generation is one example of in-situ tasks with worse scalability compared to the simulation.
c) Data analysis -uncertainty quantification (UQ): Uncertainty quantification is important for accessing the reliability of computed turbulence statistics, which are required to understand the relevant physics and formal analysis of turbulent flow simulations [9].The UQ data analysis is the in-situ task in this case and is divided into two portions.The first portion is to update the sample-estimated autocorrelation function at a series of time lags, known as training lags.The second portion is to use the sample-estimated values to model the autocorrelation function, and calculate the uncertainty in a sample mean.The first portion is executed more frequently than the second, but has a lower computational cost, i.e. the first portion takes only neglectable time compared to the second portion.
Because of the different frequency and computational cost of individual portion of uncertainty quantification, it is an example of complex in-situ tasks with different portions, which are suitable to different of in-situ techniques.

IV. EXPERIMENTAL SETUP
We introduce the system setups, CFD case and evaluation metrics of our use case evaluation in this section.
a) System setup: We used two HPC systems, the Raven supercomputer at the Max Planck Computing and Data Facility (MPCDF) [3] and the Dardel supercomputer at the PDC Centre for High-Performance Computing (PDC-HPC) at Royal Institute of Technology (KTH) [1].One Raven node contains two Intel Xeon IceLake-SP 8360Y processors with 36 cores each and 256 GB RAM.Each Dardel node contains two AMD EPYIC processors, each with 64 cores and 256 GB RAM.
The MPMD configuration file defines how cores are allocated to simulation solvers and in-situ tasks for the asynchronous and hybrid in-situ approach.On each node, one set of cores is dedicated to simulation, while the rest are dedicated to the in-situ task.In this way, data transfer is only required on the node.b) CFD Case: We chose the turbulent flow inside a bent pipe, which is an internal flow, i.e., a flow bounded by walls, and exhibits many of the most critical turbulence characteristics.Additionally, the bent pipe exhibits low frequency dynamics in a phenomenon known as swirl switching [16], which makes it interesting to determine the effect of our in-situ techniques in more complex problems.We took precautions in all cases to ensure that turbulence has already developed when we apply the in-situ techniques.The CPU-based simulation uses a discretization of the physical domain with 459000 elements with accuracy of order seven, i.e 512 data points per element.A true Exascale simulation of turbulence would possess of the order of tens of millions of elements, however the work per processing element would remain similar to what used in the current simulations, thus we expect the behaviour to be transferable to larger cases.
c) Evaluation metrics: To evaluate the performance of the simulation with the in-situ tasks, we measure the execution time, perform profiling, and analyze scalability.As the performance metric, we measure the execution of 1000 simulation steps and evaluate the average execution time of one simulation step as the performance metric.For the synchronous approach, we perform the strong scalability test; for the asynchronous and hybrid approach, we first perform the configuration tests on fixed number of nodes, and repeat the these tests on different number of nodes.
We analyze the compression ratios obtained as well as properly weighted root mean squared error (RMSE) of the reconstructed data set to evaluate the compression.We also investigate whether the reconstructed data fields could be used to create meaningful visualizations.To visually verify the results, we compare the images generated both synchronously and asynchronously with the images generated in post-processing from VTK files.We measure the size of the VTK files for the post-processing image generation and uncertainty quantification, which is not needed in the in-situ case, to demonstrate the memory savings from in-situ techniques.
We repeat each experiment three times, and the arithmetic average of the obtained evaluation metrics is reported.

V. USE CASE EVALUATIONS
In this section, we evaluate the CFD simulation with the three use cases described using synchronous, asynchronous, and hybrid in-situ executions.

A. In-situ data compression
We first study synchronous and hybrid in-situ data compression to a turbulent fluid in a bent pipe simulation with Nek5000.
a) Implementation: For synchronous data compression, we reuse the Fortran functions from Nek5000 to execute the lossy physics-based truncation mentioned in the previous section, and use C/C++-based adaptor functions to pass the output to the C/C++ ADIOS2 writer, which subsequently performs lossless BZIP2 compression and IO operations.The Nek-proc adaptor functions in Fig. 2 are written in Fortran in this case, and no communication is required because the cores already hold the data to perform lossless compression and to write out locally.
Because lossy compression is tightly coupled with the simulation, fully asynchronous compression is difficult to achieve.As a result, we tested hybrid data compression, in which we use the same functions, workflow and the Nekproc adaptor functions (Fig. 4) as in the synchronous case, but asynchronously perform the lossless compression.We can use the C/C++ writer in the ADIOS2 writer-reader pair instead of the file writer with lossless compression from ADIOS2 with runtime configurations.The data is synchronously truncated in a lossy manner with the simulation and then passed through Proc-writer adaptor functions, which are a group of functions in Fortran and C/C++, to the ADIOS writer.Lossless data compression is then performed asynchronously, is entirely programmed in C/C++ and runs on a different set of cores from the original simulation.For this case, C/C++-based Readerproc adaptor functions connect a reader in the ADIOS2 writerreader pair to a separate file writer with lossless compression from ADIOS2.We should point out that the workload on each core for lossless compression is distributed evenly, which is not necessarily the case in the simulation.
b) Evaluation: To show that compressed data sets, even at high compression ratios, are still relevant and meaningful for analysis, we present Fig. 5, where we show a slice of one reconstructed velocity component for compression with input error ϵ = 10 −2 , which correspond to a file with a compression ratio of 51, i.e 98% the data has been discarded.In the figure we observe that all the features of the turbulent flow are preserved even at these rates.
We observe that most compression artifacts happen at the element boundaries, mostly because for spectral element methods, continuity among elements is enforced weakly by using direct stiffness summation and the compression scheme we use truncates information locally at the element level without care for neighbouring data.However, this property allows for minimal communication and computation that ultimately produces the performance that will be subsequently shown in Fig. 7.We note that the compressed fields have been shown to produce correct statistics and modal decomposition even with the presence of such artifacts.[22], [27].
In Fig. 6, we show the post-computed RMSE between the original and reconstructed fields in physical space for compression with the maximum allowed error of ϵ = 10 −2 .As expected, higher compression ratios can be attained by allowing higher errors and even if not explicitly calculated at run time with physical variables, the errors are always still within the appropriate bounds in the reconstructed data sets.
Having confirmed that no relevant artifacts are introduced due to the in-situ implementation of compression on both Dardel and Raven supercomputers, we analyze the performance of the implementation.
For this purpose we ran a strong scalability test for the simulation with synchronous in-situ data compression every 50 simulation steps, which is a high frequency to write checkpoint/restart file, using 12, 16, 20, and 24 nodes on Raven (i.e., 864, 1152, 1440 and 1728 cores).As shown in the left graph of Fig. 7, the execution time of Nek5000 with this configuration decreases as the number of cores increases, and it achieves excellent strong scalability.The execution of Nek5000 consumes the majority of the time, while the compression and data output consume a negligible part of the total time (1.5 % of the total execution time).
We further profiled the performance of lossless compression and data writing in the synchronous approach with TAU [30].We find that ADIOS2 lossless compression takes nearly the same amount of time in all cases, while the time to write out compressed data decreases as the maximum allowed error increases.This is expected as the total compression ratio rises, requiring us to write out less data via the IO subsystem.
For the analysis of the hybrid data compression we evaluate the execution time on 24 nodes when 1, 9, 18, and 36 core(s) out of the 72 cores on each of the used nodes on Raven supercomputer are allocated for the asynchronous part of the data compression, i.e., the lossless compression done by ADIOS2.Because there are fewer cores for simulation, the execution time increases with the number of cores assigned to the in-situ data compression, as shown in the right graph of Fig. 7. Furthermore, similar to the synchronous in-situ data  compression, the execution of the simulation consumes the majority of the time.However, even the best hybrid approach takes longer than the synchronous approach, as additional MPI communication is required.

B. In-situ image generation
Then we study synchronous and asynchronous in-situ image generation for turbulent fluid in a bent pipe simulation with Nek5000.
a) Implementation: In the synchronous image generation case, we used the in-situ adaptor functions from the in-situ package repository developed by Atzori et al. [4] as the Nekproc adaptor functions in Fig. 2 to connect the Fortranbased Nek5000 and the C/C++-based in-situ task.Because of the different grid used by Nek5000, the pressure scalar and velocity vector fields are mapped into the VTK grid, with a deep copy of the simulation results in the adaptor functions.
In the asynchronous image generation case, we construct two groups of adaptor functions for the simulation solver and in-situ task.The Nek-writer adaptor functions shown in Fig. 3 are a group of Fortran and C/C++ functions.They connect the simulation solver and the writer in the ADIOS2 writer-reader pair and pass the pressure and velocity data using the Nek5000 data structure.C/C++ only Reader-proc adaptor functions connect the reader in the ADIOS2 writer-reader pair and the image generator.The VTK unstructured grid is generated during image generator initialization based on the number of elements in one core dedicated to the image generator.The image generator's adaptor functions also perform a deep copy to convert the fields to VTK format.
b) Evaluation: We ran the same strong scalability test on Raven with 12, 16, 20, and 24 nodes as before.The left graph in Fig. 8 depicts the performance of the simulation with synchronous in-situ image generation every two steps.Although the Nek5000 scales well, the execution time to generate images with ParaView/Catalyst does not scale and remains nearly constant.The MPI collective communication was identified as the bottleneck in the previous study [4].This also corresponds to our poorly scaling overall execution time for image generation as the number of cores increased.
We also evaluated the execution time, when 2, 4, 9, 18, and 36 cores in 72 cores on each of 24 Raven nodes (i.e., 48, 96, 216, 432 and 864 cores) are used for the asynchronous image generation every two simulation steps.To better understand the performance of the simulation with asynchronous image generation, we measured total execution time, simulation Nek5000 time, and in-situ image generation time.
The right graph in Fig. 8 shows that the time to generate images every two simulation steps scales poorly with the number of cores, while simulation efficiency decreases as the number of cores devoted to image generation increases (and thus not to the simulation).The total execution time is the maximum time of simulation and image generation.Thus, as the number of cores for in-situ image generation increases, the total execution time decreases until it no longer scales and the negative effect on simulation time takes precedence.The total execution time is minimal, with one quarter of cores on 24 nodes for in-situ image generation, and the asynchronous image generation and simulation take the same amount of time.
To study the configuration with the best performance when the total number of cores are changed, we repeated the configuration evaluation with 12, 16, 20 and 24 Raven nodes.Fig. 9 compares the total execution time of synchronous image generation with total execution time of asynchronous image generation every two simulation steps.The asynchronous in-situ approach outperforms the synchronous approach.The best performances of 12, 16, 20 and 24 nodes appear with 2, 4, 9 and 18 cores on each node for in-situ image generation respectively.The best total execution times of simulation with asynchronous approach are approximately 60% shorter than the synchronous approach, and scalability is improved, but it cannot scale ideally due to the communication cost of the MPI collective communication.
We repeated the whole experiment sets with image generation every five simulation to investigate the influence of the in-situ task frequency.As shown in in-situ approach also outperforms the synchronous approach with this frequency.The best performances of 12, 16, 20 and 24 nodes all appear, when two cores on each node are used to generate the image.The simulation with asynchronous in-situ every five simulation steps has a lower in-situ workload and MPI communication cost than every two simulation steps, so it has strong scalability with the number of nodes we used.

C. In-situ uncertainty quantification
We also study synchronous, asynchronous and hybrid insitu uncertainty quantification for a turbulent fluid in a bent pipe simulation with Nek5000.
a) Implementation: In the synchronous UQ, the Nekproc adaptor functions shown in Fig. 2 pass the data from Fortran to C/C++ and use C/C++ functions as bridge function to embed Python, since the UQ analyzer is programmed in Python.The simulation results are passed as a singledimensional Numpy array to the data analyzer to update the training lags.
Two groups of adaptor functions are used in the asynchronous UQ for the simulation solver and data processor.The Nek-writer adaptor functions in Fig. 3 attached to the simulation solver and the resulting workflow are similar to the mixed Fortran and C/C++ adaptor functions in the asynchronous image generator.The Reader-proc adaptor functions connected to the UQ data analyzer are programmed in Python.To simplify the workflow, the ADIOS2 Python APIs are used to build the reader in the ADIOS2 writer-reader pair.
In the hybrid UQ, three groups of adaptor functions shown in Fig. 4  uncertainty from 50 training lags.This frequency is rather high for UQ.We used it as a stress test.We also examined the performance with a standard UQ frequency (i.e., updating one training lag every 20 simulation steps and estimating the uncertainty from 25 training lags every 500 simulation steps).b) Evaluation: We investigated the scalability of the simulation with synchronous in-situ UQ on 12, 16, 20, and 24 nodes.The total execution time of the stress test and the execution time of the simulation scale well, as shown in the left graph in Fig. 11.But UQ takes longer than simulation, and in the profiling reports, the execution time of UQ on each core varies.Because the estimation portion of UQ includes model estimations involving regression, the workload to calculate the uncertainty is unknown.As a result, the load balancing is dependent on simulation results and is frequently suboptimal.
We used 9, 18, 24, and 36 cores in 72 cores on each of 24 Raven nodes for asynchronous UQ with stress test frequency.The right graph in Fig. 11 shows that the performance of the asynchronous UQ is even worse than the synchronous approach.The in-situ UQ takes longer than the Nek5000 simulation in this approach.Although its performance improves as the number of cores for in-situ increases, the total execution time with 36 cores for asynchronous in-situ is still twice as long as the synchronous approach.This is due to the large workload difference between the UQ steps.To ensure data consistency, the simulation cores need to communicate with the in-situ UQ core at each simulation step.This keeps the simulation and UQ from running concurrently.
We used 9, 12, 18, and 36 cores within 72 cores on each of 24 Raven nodes for a hybrid setup, where updating the training lags is done synchronously and model estimation and uncertainty calculation is done asynchronously.By lowering the communication frequency, the hybrid approach enables concurrency.The communication between simulation and insitu cores is only required before the asynchronous section of the UQ.The middle graph in Fig. 11 shows that the total execution time on 24 nodes decreases with the number of cores for the asynchronous portion until the simulation takes almost the same amount of time as the asynchronous UQ portion; then the total execution time increases because the simulation time takes longer in this phase and increases with the ratio of cores for in-situ tasks.When compared to the synchronous approach, the hybrid approach improves the cache hit ratio, and better balances the workload.The data transfer in the hybrid in-situ approach is also optimized when compared to the asynchronous approach.The total amount of data required to be transferred in the hybrid approach is significantly less than in the asynchronous approach, and in comparison to the asynchronous approach's frequent small trunk data transfer, the infrequent larger trunk data transfer in the hybrid approach results in lower latency.
We repeated the evaluation with 12, 16, 20 and 24 Raven nodes, and performed the UQ with both stress test frequency and common frequency, to study the influence of the total number of cores and in-situ task problem size on the configuration with the best performance.Because the common frequency leads to a relatively cheaper computational cost, we add one core per Raven node as additional configuration in the tests.Fig. 12 compares the total execution time of synchronous with the best total execution of hybrid UQ of both tests.The hybrid in-situ UQ outperforms the synchronous approach.In the case presented here, the best performance of the common case appears with one core on each node for in-situ; the best performance of the stress test appears with twelve cores on each node.The total execution time of simulation with hybrid UQ are approximately 50% shorter than the synchronous approach in the stress test, and scalability is improved.In the common case test, the hybrid approach still outperforms the synchronous approach slightly.

VI. DISCUSSION AND CONCLUSIONS
In this paper we focus on the resource distribution for the synchornous, asynchronous and hybrid in-situ approaches on homogenenous, multicore-CPU based HPC systems.
We can conclude from our in-situ data compression that synchronous execution is favorable for comparably small in-situ tasks for the sake of the performance because in asynchronous or hybrid executions, not only additional communication overhead is introduced, but the number of resources available for the simulation is also reduced, while the dedicated resources for the in-situ task are underutilized.
From our in-situ image generation, we can conclude that for larger in-situ tasks that do not scale well, an asynchronous approach is preferable to a synchronous one because fewer resources can be assigned to the in-situ task, to limit the effects of poor scalability.The sweet-spot for how many resources are assigned to the in-situ task (and thus remain for the simulation) must be determined, and the sweet-spot distribution might change with the total number of resources due to the different scalability of the simulation and in-situ tasks.
We can conclude from our in-situ data compression and uncertainty quantification that hybrid execution is the preferred model for cases where the implementation of the in-situ tasks are strongly dependent on functions from the main solver or are performed frequently, but other parts can overlap the execution of the simulation and benefit in the performance from the asynchronous execution.When the scalability of the simulation and in-situ tasks is similar, the sweet-spot resource distribution is stable.With this property, when larger number of resources are used, the ideal resource distribution could be predicted from the performance of fewer resources without constructing and analyzing the complex performance models.
In general, we explored synchronous, asynchronous, and hybrid in-situ approaches and compared three use cases with different characteristics in this paper.First, we reduced the amount of data and corresponding IO time by using in-situ lossy and lossless compression.Then, we performed in-situ visualization, and finally, we used uncertainty quantification in an in-situ manner.For each of these use cases, we analyzed the benefits of the three in-situ approaches.Due to the comparably lower workload and good scaling behavior, compression performs best in synchronous mode; asynchronous or hybrid approaches just add overhead without significant benefits.For visualization, the asynchronous approach performed best, as it allows optimizing the computing resource allocation to minimize the overhead from the MPI collective communication.For uncertainty quantification, the synchronous approach outperforms the asynchronous one.As the frequencies and computational costs of the two sections of the UQ differ, the simulation and data analysis are not executed concurrently in the asynchronous approach.However, UQ consists of two portions that can be split and thus performed in a hybrid in-situ mode, resulting in lower total data amount transferred, lower latency from the larger trunk size of data transferred in one communication, and better data access pattern in simulation and data analysis, and thus the best performance.We can thus conclude from these case studies that in-situ tasks with high frequency, low computational cost, or/and low communication overhead may perform better in synchronous approach, whereas in-situ tasks with low frequency, high computational cost, high communication overhead, or/and low complexity to decouple from simulation could benefit from the asynchronous approach.
In future work we plan to derive models from our experimental findings that will help to choose among the synchronous, asynchronous, and hybrid in-situ approaches.Furthermore, we investigate in-situ approaches on hybrid computational nodes consisting of GPUs and CPUs.NekRS [13] and NEKO [17] are promising simulation solvers with GPU support.Current approaches for scientific simulations often only use the GPUs for the simulation, leaving the CPUs underutilized and are thus a perfect target for in-situ tasks.Moreover, we also plan to extend our study of the in-situ techniques on exascale simulations with billions of data points (or equivalent millions of elements in Nek5000) and verify the possibility to predict optimal resource distribution for the exascale case with the performance data from the large cases (such as the ones in this paper).Finally, we plan to continue working with ADIOS2 towards a generic in-situ framework.ACKNOWLEDGMENT Dr. Saleh Rezaeiravesh and Christian Gscheidle are gratefully acknowledged for providing the UQ case.This work is partially funded by the "Adaptive multi-tier intelligent data manager for Exascale (ADMIRE)" project, which is funded by the European Union's Horizon 2020 JTI-EuroHPC research and innovation program under grant Agreement number: 956748.The authors would like to express their gratitude the Max Planck Computing and Data Facility (MPCDF) for providing compute time on the Raven Supercomputer.Furthermore, part of the computations were enabled by resources provided by the Swedish National Infrastructure for Computing (SNIC), partially funded by the Swedish Research Council through grant agreement no.2018-05973.

Fig. 1 .
Fig. 1.Illustration of a simulation with synchronous, asynchronous and hybrid in-situ tasks.

Fig. 4 .
Fig. 4. Illustration of the workflow of a Nek5000 simulation with a hybrid in-situ task.

Fig. 5 .
Fig. 5. Slice of the velocity magnitude downstream from the bent section.a) is the original data set, while b) is the reconstruction of a field compressed with a maximum allowed error of 10 −2 .

Fig. 6 .
Fig. 6.RMSE of a slice of the 3D field for a maximum allowed error of 10 −2 .The error is shown per spectral element.

Fig 10 Fig. 8 .
Fig. 8. Execution time of Nek5000 with synchronous in-situ image generation every two steps on Raven supercomputer (left) and asynchronous in-situ image generation every two simulation steps on 24 Raven nodes (right).

Fig. 10 .Fig. 11 .
Fig.10.Logarithmic execution time of Nek5000 with synchronous and asynchronous in-situ image generation every five steps on Raven supercomputer.(The configuration of the asynchronous in-situ approach is two cores per node for in-situ task.)

Fig. 12 .
Fig. 12. Logarithmic execution time of Nek5000 with synchronous and hybrid in-situ uncertainty quantification.
Fig.9.Logarithmic execution time of Nek5000 with synchronous and asynchronous in-situ image generation every two steps on Raven supercomputer.(Due to the memory limitation, the test cannot be done on 12 and 16 nodes with 36 cores per node for asynchronous in-situ.)