From Knights Corner to Landing: a Case Study Based on a Hodgkin-Huxley Neuron Simulator

. Brain modeling has been presenting signiﬁcant challenges to the world of high-performance computing (HPC) over the years. The ﬁeld of computational neuroscience has been developing a demand for physiologically plausible neuron models, that feature increased complexity and thus, require greater computational power. We explore Intel’s newest generation of Xeon Phi computing platforms, named Knights Landing (KNL), as a way to match the need for processing power and as an upgrade over the previous generation of Xeon Phi models, the Knights Corner (KNC). Our neuron simulator of choice features a Hodgkin-Huxley-based (HH) model which has been ported on both generations of Xeon Phi platforms and aggressively draws on both platforms’ computational assets. The application uses the OpenMP interface for eﬃcient parallelization and the Xeon Phi’s vectorization buﬀers for Single-Instruction Multiple Data (SIMD) processing. In this study we oﬀer insight into the eﬃciency with which the application utilizes the assets of the two Xeon Phi generations and we evaluate the merits of utilizing the KNL over its predecessor. In our case, an out-of-the-box transition on Knights Landing, oﬀers on average 2.4 × speed up while consuming 48% less energy than KNC.


Introduction
In recent years neuroscientists have been gradually revealing details of neuron operation. Using this knowledge, there is a wide research interest in studying the behaviour of single-neuron, a network of neurons and eventually study brain-wide populations of neurons. Simulating these neuronal networks on various platforms is an active field of research [ , ].
A major challenge is the sheer computational complexity that many of these neuron models entail. Even the less complex types have significant demands as the studied neuronal network increases in size both in terms of computation and data transfer or storage. Traditionally in the domain of neuroscience, the most common methods for simulating neuron models and studying their behaviour were either through widely-known mathematical software suites such as MATLAB [ ] or through specific neuromodeling tools like NEURON [ ] and Brian [ ]. It has become clear that these methods are not suitable for simulating neuronal networks of realistic sizes and high detail within a reasonable timeframe for brain research. High-Performance Computing (HPC) has been recently recognized as a viable domain for providing a variety of solutions to cope with this limitation [ , , , , , ].
In our current case study we feature a simulator for biophysically plausible neuron models, targeting a part of the human brain named the Inferior Olivary Nucleus, which specializes in the coordination and learning of motor function [ ]. The modeling accuracy is at the cell conductance level (Hodgkin and Huxley models [ ]), belonging at an analytical and complicated class of models which allow us to expose fine details of the neuron's mechanisms. This workload is an excellent candidate for parallelization on HPC architectures, such as the Intel Xeon Phi system [ ], due to the large inherent parallelism of the models. Additionally, it constitutes a realistic worst-case scenario in terms of model complexity, hence a benchmark for neuron modeling workloads.
In order to explore whether Intel's newest generation of the Xeon Phi computing platform, named Knights Landing (KNL), is a suitable platform for neuroscientific workloads, in the current paper we evaluate its performance and energy consumption compared to the previous version, Knights Corner (KNC). We utilize the aforementioned Inferior Olivary Nucleus simulator, named InfOli, which was developed for the KNC generation of Xeon Phi [ ]. This comparison will highlight how the evolution of Intel's Xeon Phi architecture can improve the performance of a challenging application in the field of computational neuroscience. Since the application is fine-tuned to the previous version of Xeon Phi processors, we will, accordingly, explore the behaviour of an "out-of-the-box" application on the KNL.
In this paper, we shall first discuss the nature and parallelization method of our simulator. We will then briefly present the architecture of the two generations of Xeon Phi HPC architectures and highlight their significant differences in hardware. Furthermore, we will present the methodology of our experimentation and evaluate their results. Finally, we will conclude with remarks on the merits and shortcomings of each platform.

Software
The InfOli simulator, depicted in Figure , is a transient simulator; brain activity is calculated in simulation steps, with each step set to represent us of activity in a fixed manner. The steps are calculated sequentially, until the entirety of the requested brain activity is computed. In each simulation step, the simulator has the task of updating the status of each neuron in a pre-defined network. The neurons are based on an elaborate, realistic model of the human neuron, derived from the work of Hodgkin and Huxley [ ]. As such, they are comprised of compartments, each modelling a different part of the neuron. The dendritic compartment holds the important task of communicating with the rest of the network; it forms connections with other neurons in the network, modelled as electrical synapses named Gap Junctions (GJ) [ ]. The somatic compartment is the main body of the neuron, where most calculations for the neuron's inner state take place. Finally, the axonal compartment acts as the output port of the neuron (specifically, in our application, of the Inferior Olivary neuron) to other parts of the brain (such as, the cerebellum). In each step, the simulator processes the current flow in the GJs of the network and then, re-calculates the states of the three compartments of each neuron. This is achieved by solving the Ordinary Differential Equations (ODE) governing the model via the Euler forward method [ ]. Each neuron may also receive an external stimulus by its environment, in each step of the simulation.
In order to boost simulation speed, OpenMP [ ] has been employed to parallelize the application. Figure relays how the simulator utilizes OpenMP threads. The network is divided in equal parts and assigned to different OpenMP threads, ensuring a balanced distribution of workload.
In each step, the threads read from the Xeon Phi's shared memory in order to calculate the state of their assigned neurons' GJs. This task requires that each thread accesses other threads' data concerning the dendritic compartments of their assigned neurons; these shared memory accesses enforce the flushing of cache lines that hold invalid data from previous simulation steps. After all relevant dendritic data is refreshed, the state of each neuron can be calculated independently from When the simulation step is completed, biological data that needs to be tracked, such as the voltage levels of the somatic membrane, is collected from each thread and recorded in the simulation's output file.
The simulator has been ported and tested primarily on the Intel KNC. An analytic methodology has been followed to boost vectorization processing unit (VPU) usage, in order to optimally utilize the platform's asset [ ]. Data transformations (struct to arrays), aligning data to cache lines and loop transformations have been tested on the KNC, with the help of Intel's profiling tools (Intel VTune Amplifier). As such, the simulator is not expected to be optimal for the second generation of Xeon Phi, the KNL. However, due to their similar architectural design, the application is a good candidate for porting on both platforms.

Hardware
The first commercial generation of Xeon Phi products is named Knights Corner. This version of Xeon Phi is an Intel accelerator platform arranged in a host-andcoprocessor fashion and features up to cores, each with four instruction streams [ ]. It supports traditional parallel-programming paradigms, such as MPI [ ] and OpenMP [ ], in contrast to Graphics Processing Units (GPU) requiring platform-specific programming paradigms [ ]. After the Xeon host boots a KNCspecific software stack on the Phi, named Intel Manycore Platform Software Stack (MPSS), the latter may be used independently, for native workload execution. The KNC accelerator features vectorization processing units (VPU) [ ], which can parallelize multiple floating-point (FP) operations.
Intel's second generation of Xeon Phi processors introduced several architectural differences with respect to its predecessor. KNL is a standard Intel Many-Integrated Core (MIC) Architecture standalone processor that can boot stock operating systems and connect to a network directly via common interconnects such as Infiniband, Ethernet, etc. This is a significant differentiation over Knights Corner, which is a a PCIe-connected device and, therefore, could only be used when connected to a separate host processor. In KNL, cores are integrated in couples into structures named tiles in which they share a MB L cache. Each core is connected to two vector processing units as opposed to the single VPU unit per core present in KNC models, making vectorization a key aspect in this platform's computational power. KNL processors can have up to tiles for a total of cores, each capable of hyperthreading with up to threads per core, and VPUs. Communication between those tiles is achieved through a cache-coherent D mesh interconnect which replaces the bidirectional ring bus used on the KNC coprocessor. This on-die interconnect allows for different clustering modes of operation which offer various degrees of address affinity to improve performance in HPC applications.
In addition to these features, KNL introduced a new memory architecture to provide both large memory capacity as well as high memory bandwidth. To do so, traditional DDR memory is complemented with what Intel named MultiChannel Dynamic Random Access Memory (MCDRAM). This on-package memory does not achieve higher single data access performance than main memory but supports a higher bandwidth [ ]. As with the mesh clustering modes, MCDRAM can be configured in different memory modes: i) to serve as cache for the DDR memory (cache mode), ii) to be mapped as regular memory into the system's address space (flat mode) or, iii) to work as hybrid memory where part of the MCDRAM acts as cache and the rest is allocated to the address space (hybrid mode). KNL's characteristics and its high degree of customization make it a suitable platform for high perfomance computing applications like the Inferior Olive simulator.

Experimental Setup
The measurements presented in this section have been carried out using two different generations of Intel Xeon Phi. The Knights Corner co-processor's model is P, featuring cores at . GHz, each supporting up to threads running concurrently via multithreading technology. Cores run at W thermal design power (TDP). The application is designed to run natively on the co-processor, thus excluding any impact from its Intel Xeon host on its measured performance. Specifically, after compiling and transfering via Secure Copy Protocol (scp) all necessary binaries to the co-processor, the host remains idle throughout the experiment.
The Knights Landing processor's model is , with cores at . GHz and similar multithreading capacities. Its TDP is noticeably lower at W. MCDRAM for the KNL was set to cache mode as this setting is completely transparent to software and allows for "out-of-the-box" codes like the neuron simulator being tested, to take advantage of the high-bandwidth-memory technology. As for the clustering mode, quadrant configuration was chosen based on recognition that the cache-quadrant combination offers performance gain to HPC applications [ , ].
Finally, in order to get a better grasp on the performance offered by the two generations of accelerator platforms, we include performance curves from an Intel Xeon E --v , a -core server-grade processor utilizing threads concurrently. The processor's simulation speed acts as a baseline, with the added benefit that codebases developed for Xeon Phi accelerators are compatible with Xeon (or any generic x ) processors.
For the power measurements in this section, different methodologies have been followed for the two platforms. For the Knights Landing processor, the processor's power consumption was sampled via Intelligent Platform Management Interface (IPMI) [ ] via a script running concurrently with each experiment's execution. Polling frequency was set to approximately Hz. Energy consumption for each experiment was then calculated by integrating the power samples over the simulation's duration. On the other hand, power measurements on the Knights Corner co-processor is achieved by accessing the host's logs of information and errors regarding the co-processor. These logs are attained via a built-in tool named micrasd which can track the KNC's power in intervals of milliseconds. The reports are generated from the beginning of the simulation and by summation of each report until the end of the experiment, an accurate estimation of total energy consumption can be attained.
In each experiment, a network of neurons connecting to each other via the Gap Junction mechanism, explained in Subsection , is generated. The neuron connections are generated randomly, with each pair of neurons given a chance to form a bond regardless of their position on the neuronal grid. This chance is calculated based on the amount of connections each neuron is designed to have for each experiment, as well as the total neuronal network size; a division of the two variables calculates the network's average connection density, which, in turn, directly leads to the chance of a pair of neurons forming a bond.
Compilation for the KNC has been carried out using Intel's compiler icc version . . , whereas on the KNL, icc Version . . . has been used. On both platforms, the options used for vectorized code are -O for the best available compiler optimizations, -vec-report for a detailed analysis of vectorized code generated, -opt-subscript-in-range to inform the compiler that no integer in the main loop is calculated exceeding the value of 2 31 , allowing more loop transformations and -lm to access math libraries needed throughout the model's calculations. For measurements that use unvectorized code, the options -no-vec -no-simd -no-qopenmp-simd have been utilized to ensure the compiler avoids all SIMD commands.

Experimental Results
In Figure , we can observe obtained simulation speed of each platform for networks of varying connectivity density. The measurements explore varying network sizes, where each neuron has a fixed average amount of connections to the rest of the network.
All experiments in Figure   speed-up is 6×, while in some cases the KNC comes in front with up to . × speed up over the KNL. More specifically, we can observe that, in the cases of low connectivity density, which translates to a low amounts of workload per thread, the KNL shows a superior performance to the KNC.
In cases of small workloads, the efficiency in usage of parallelization assets is diminished, thus single-threaded performance becomes much more important for overall simulation speed. The KNL demonstrates a considerably stronger singlethreaded processing power and overtakes the KNC by a fair margin. For both the KNL and the KNC, we can observe that the difference between vectorized and unvectorized code is minimal when connectivity density is low; Gap Junctions represent a significant portion of the total workload and thus, when they are few or completely absent, vectorization fails to boost application performance. We can also observe that the Xeon processor, which excels at handling mostly serial code, may even surpass the KNC accelerator for small-scale simulations.
On the other hand, as the computational workload assigned to each thread increases for denser networks, the KNC performs significantly better. The performance gap between the two platforms lessens as the KNC can use its assets with increasing efficiency, since the application has been optimized with the KNC architecture in mind. The gap between vectorized and unvectorized code widens significantly for the KNC, whereas there is a more stable difference in the case of the KNL. Better usage of VPUs leads to the KNC outperforming the KNL; indeed, for workloads of more than , neurons, each forming approximately , synapses, the KNL is surpassed by the KNC. As expected, both platforms perform significantly better than the baseline Xeon processor; the KNL and the KNC simulate networks of , neurons, each with , synapses, approximately . × and . × faster than the server-grade processor, respectively.
It should be noted that, in terms of performance predictability, the KNL is heavily favoured. Its performance is linear and very predictable. On the contrary, the KNC's performance is harder to anticipate, when operating with vectorization options enabled. The platform's capability to take advantage of its computational resources (threads, VPUs) increases with the supplied workload. Because of this behaviour, it forms a "plateau", during which simulation time for larger networks remains stable, or even lessens, due to better usage of the SIMD commands generated by compiler directives.
Beyond a certain point in network sizes, which differs based on how dense the network is, the aforementioned "plateau" ceases to exist and KNC's performance curve resumes its linear nature. The existence of such "plateaus" impacts the performance predictability of the KNC, whereas the KNL does not exhibit similar behaviour. This can be attributed to the less efficient usage of vectorized code in the KNL's case. For both platforms, unvectorized code, which omits the usage of VPUs, displays a very predictable behaviour.
In Figure , we present information regarding the energy required by each computing fabric in order to simulate a second of brain activity, measured in mWh. The Figure is directly linked to Figure , since energy consumption is dependent on execution time needed for simulation of each second of brain activity. As such, we can observe similar patterns between the two Figures. On average we have to note that the KNL consumes % less energy than the KNC. Because of the KNL's lower TDP and better performance for light workloads, there is a significant reduction in energy consumption when computing for small networks. To put this claim into perspective, whereas the simulation of one second of brain activity in a network of neurons, with a density of synapses per neuron, requires over 1200 ℎ for the KNC, the KNL consumes under 300 ℎ for the same workload, improving on energy efficiency by a factor of 4×.
On the contrary, due to the KNC's smaller execution times for larger, denser networks, it is preferable from a power consumption standpoint to the KNL for such workloads. A network of , neurons, each forming , synapses with the rest of the network, requires 27% less energy on the KNC (1600 ℎ per In Figure , information regarding the efficiency with which each platform manages its OpenMP threads is displayed. In HPC, the efficiency with which an application utilizes the underlying platform's resources can be calculated as the speedup yielded by employing said resources, compared to a single-threaded performance, divided by the amount of resources used, such as the number of processors used to run the application, or the number of threads spawned by the application. In our case, we calculated the efficiency metric by dividing execution speedup with the number of OpenMP threads spawned, with a range of OpenMP threads utilized from to , on both platforms. In each subfigure, network density has been set to , synapses per neuron and we explore networks of different size. For the KNL, we can observe that the efficiency of utilizing up to approximately 50 threads remains at satisfactory levels. In these cases, each core spawns either one or two threads (due to the selected balanced thread affinity) and, in contrast to the KNC, the KNL's cores operate significantly better when operating with only one thread [ ]. The KNL maintains a reliable efficiency for low degrees of threading regardless of the simulated network's size, whereas the KNC's efficiency suffers for small workloads, such as for networks of neurons. Larger networks, however, offer better opportunities for the KNC to utilize its computational assets efficiently, maintaining a speedup-to-threads ratio above 70% even for threads. The KNL's threading efficiency sharply declines when employing massive degrees of parallelism, dropping below 40% when using more than 140 threads. The application's inability to utilize the entirety of KNL's assets efficiently to tackle demanding simulations explains the performance gap between the two platforms for larger workloads. This inability is mostly attributed to the fact that the simulator has been fine-tuned to the KNC environment and has been tested "out-of-the-box" on the KNL.

Conclusion and Outlook
In this paper, a computationally demanding application from the field of computational neuroscience that had previously been extensively developed and optimized for the Intel KNC, has been tested "out-of-the-box" for the second generation of Xeon Phi, the KNL. The InfOli biophysically-accurate simulator's performance was tested using a range of workloads, from small, unconnected neuronal populations to larger, dense networks. The results were evaluated from both a simulation-speed and a power-efficiency standpoint. On average KNL offers a speed up of . × while consuming % less energy. Smaller workloads, by taking advantage of the KNL's superior single-threaded performance, exhibit very significant gains in both speed and, even more so, energy consumption, with specific experiments demanding 75% less ℎ of energy per second of simulated brain activity on the KNL. On the other hand, without further fine-tuning of the application to the architectural details of the KNL, OpenMP-thread efficiency suffers when running on the KNL, causing the simulator to handle more demanding networks poorly, relatively to the optimized KNC version. Furthermore, throughout the whole range of experiments, it has been shown that the KNL offers a more robust, dependable performance curve with little variability.
These findings are promising enough to warrant further optimization of the simulator for the new generation of the Xeon Phi. As future work, we would suggest using an optimized version of the simulator on a cluster of KNL processors, in order to simulate neuronal networks of much larger sizes and take advantage of Intel's OmniPath technology for inter-node communication [ ].