Pythia: Scheduling of Concurrent Network Packet Processing Applications on Heterogeneous Devices

Modern commodity computing systems are composed of a number of heterogeneous processing units, each one with its own unique performance and energy characteristics. However, the majority of current network packet processing frameworks targets only one device (either the CPU or an accelerator), leaving the remaining computational resources underutilized or even idle. In this paper, we propose an adaptive scheduling approach for network packet processing applications that exploits any heterogeneous architecture that can be found in a commodity high-end hardware setup. Our scheduler not only distributes the workloads to the appropriate devices in the system to achieve the desired performance results, but also enables the multiplexing of diverse, concurrently executed network packet processing applications, eliminating the interference effects introduced at run-time. The evaluation results show that our scheduler is able to tackle any interference in the shared hardware resources as well to respond quickly to dynamic fluctuations (e.g., application overloads, traffic bursts, infrastructural changes, etc.) that may occur at real time.


I. INTRODUCTION
The advent of high-end commodity heterogeneous systems (i.e. systems that utilize multiple processing units, typically CPUs and/or GPUs) has motivated the networking community to exploit alternative architectures [1]- [4]. Yet, the majority of those works often target a single device, usually underutilizing the rest of them. Developing a network packet processing application framework that can exploit any available device efficiently and consistently, between a wide range of diverse workloads running concurrently, is highly challenging. First, interference between different devices needs to be minimized in an automated way [5]. Second, support for multiple, concurrently-running, applications should be provided, which is typical in networking middleboxes. Third, data heterogeneity needs to be considered, as the traffic variability can significantly affects the system's utilization and performance [6]- [8].
In this paper, we propose a scheduling approach tailored for network packet processing workloads executed concurrently in a heterogeneous system. Specifically, our proposed solution is designed to explicitly tackle the heterogeneity that is introduced in the underlying hardware architectures, the applications and the network traffic rate. The scheduler dynamically adapts to performance fluctuations that may occur, such as traffic bursts or overloads. The contributions of this work are the following: (i) Performance characterization and power consumption of several typical network applications concurrently executed on heterogeneous, commodity multidevice systems. (ii) A software-based energy profiling tool that reports live power consumption measurements for any device in a commodity system setup, by exploiting the corresponding hardware registers. (iii) A scheduling approach that can efficiently select the best device(s) to execute one or more typical packet processing applications, based on current system and network conditions, using a predefined policy goal.
II. SYSTEM SETUP a) Hardware Setup: Our hardware setup consists of an Intel Core i7-8700K CPU packed with an integrated UHD Graphics 630 GPU and a high-end NVIDIA GeForce GTX 1080 Ti GPU 1 . Our setup presents interesting trade-offs: even though the integrated GPU has fewer and less powerful resources when compared to a high-end discrete GPU or the CPU, it consumes much lower power. It is also directly connected to the system's main memory via a fast on-chip ring bus, which results to fewer data transfers and hence lower processing latency than a discrete GPU. Yet, the CPU is considered the best option for latency-aware setups, as it can sustain very small processing times compared to the batchoriented processing followed by both GPU types. Our machine is also equipped with a 40-Gbps NIC (4x 10 Gbps ports). b) Applications: We have implemented the following packet processing applications, using the OpenCL 2.1 SDK: Deep Packet Inspection (DPI): A very common operation when processing network traffic. We use the Aho-Corasick algorithm, which offers multi-pattern searching, and fed it with 10,000 fixed-string patterns from Snort IDS [9].
Packet Hashing (MD5): Typically used in redundancy elimination and in-network caching systems [10]. We implement the MD5 algorithm, which offers low collisions and is mainly used for checking data integrity or deduplication.
Encryption (AES): We use the Cipher Block Chaining (CBC) operation alongside with a 128-bit key per connection. Due to its nature, this encryption technique is a representative form of computational-intensive packet processing.
III. IMPLEMENTATION Each of the three applications is implemented as a unique kernel. In OpenCL, an instance of a kernel is called workitem and a set of multiple work-items is called work-group.
Typically, GPUs contain a very fast thread scheduler, thus it is recommended to spawn a large number of work-groups. In contrast, CPUs perform more efficiently, when the number of work-groups is close to the number of the available cores. Discrete, GPUs have a dedicated memory space, meaning that an explicit data transfer from the host (i.e. CPU) to the device (i.e. GPU) must precede. On top of that, a data buffer, which is required for the execution of a computing kernel, has to be created and associated with a specific OpenCL context. Even though different contexts cannot share data directly, the data transfers (host-device-host) and the GPU execution are performed asynchronously, which significantly improves parallelism. After careful evaluation, we notice that data transfer requirements differ per application. For instance, DPI and MD5 kernels do not change packet headers or payloads, so there is no need to transfer them back to the host after the execution. On the other hand, AES kernel changes the packet contents, making backward transfers inevitable. Still, when the processing is performed on the main processor or an integrated GPU, expensive data transfers are not required (both devices have direct access to the host memory) as long as the corresponding memory buffers are explicitly mapped, via the clEnqueueMapBuffer() function. a) Batch Processing: A typical approach is to place packets into batches exactly in the same order they are received through each NIC. However, when using multiple processing devices, packets can be reordered. To prevent reordering, devices are being syncronized using a barrier, which enforces them to execute in a lockstep fashion. There is a major performance drawback when using this approach though, as fast devices have to wait for the slow ones. To bypass this problem, we pre-classify incoming packets by building the typical 5-tuple flows before creating the batches and then enqueue all packets of a flow in the same batch. b) Performance Measurements: We now present the performance achieved in our hardware setup. We use netmap [11] to generate and transmit network packets to our machine 2 . Due to space constraints, we select to present a fraction of configurations that clearly show the diversity of performance characteristics of each device and application. Each configuration was active for a 10-second window, during which the performance of the system was being monitored every second. Tables I, II and III present both the individual and the aggregated performance achieved by the DPI, AES, and MD5 applications, when executed either standalone or by sharing the device with 1 or 3 co-workers. The same benchmark executions are repeated for all the available devices of the system, i.e. CPU, integrated GPU and discrete GPU. We note that the current implementation of our scheduler supports the concurrent execution of every network packet processing application combined; for the purposes of simplicity though, 2 Even though our machine is equipped with a 40 Gbps NIC, the overall throughput achieved is not higher that 30 Gbps; the reason is that both our NIC and discrete GPU (GTX 1080 Ti) run at a reduced I/O bandwidth (PCIe x8), due to motherboard PCIe constraints  we present only the combination of two different applications in each device at a time.
We observe that the benefits from constantly increasing the batch size stop at some point. However, different applications on different devices may require batch size optimizations within a specific range to reach maximum throughput. In the case of DPI, for example, increasing the batch size further of 4096 packets has little impact on the throughput of a discrete GPU. Furthermore, we also notice that the sustained throughput is not consistent across diverse devices. For instance, an integrated GPU seems to be a reasonable choice when performing MD5 and DPI on large packet batches, compared to AES, where the same device results to low throughput. Overall, the CPU is the best option for latencyaware environments, especially when combined with small batch sizes and not many interfering kernels ( Figure 2). Due to its characteristics, AES is an exception to the rule, benefiting more if placed on the discrete GPU even in the case of latencycritical scenarios, regardless of the number of co-workers.
Apparently, there is no clear ranking between the devices, not even a clear winner. A device can actually be the best fit for some applications and the worst fit for some others.
When executing concurrently more than one network packet processing applications in one device, we face the challenge of unknown interference effects, due to contention for hardware resources, software resources and false sharing of cache blocks. In the case of the GTX 1080 Ti, for example, we can see that a large batch size (16K) has negative effects in cases where more than one applications are being executed. The reason is that both the discrete graphics card and the NIC, share the same I/O interconnect (i.e. the PCI bus). Another interesting fact is when having multiple instances of AES on the same device: the aggregated performance is lower compared to that of every other kernel combination on a given device, as shown in Table II. GTX 1080 Ti is an exception as it is not affected by the compute-intensive nature of AES and is able to sustain peak performance even when four AES instances are concurrently executed. On that note, despite that the integrated GPU performs tolerably well on single AES execution when combined with a large batch size (Figure 1), it is the least suitable device when the desired scenario requires the concurrent execution of an AES instance alongside an instance of any other application, as shown in the bottom part of Table II. Moreover, when DPI is coupled with MD5 (Tables I and III), the GTX 1080 Ti seems to have a twofold drawback as its performance is poor while being the most energy-hungry device of the system. The CPU and the integrated GPU both achieve similar performance results, but in the case of the UHD graphics card, top performance can be sustained regardless of the batch size. These observations lead us to a conclusion that in the presence of those two applications, by offloading the workload to the integrated graphics card we sustain top performance while keeping the latency low and we also keep the CPU and the discrete GPU idle, which either promotes the energy efficiency of the system or provides room for the execution of at least one computation-intensive application, like AES, without sacrificing performance.

IV. REAL-TIME SCHEDULING
Our scheduler is based on a lock-free architecture model, as illustrated in Figure 3 3 . Each worker is responsible for capturing the network traffic from a set of bound network interfaces, spawning the execution of a kernel on a target device and collecting performance metrics.
As a first phase, our scheduler uses an offline analysis tool that creates all the possible application combinations, tests them on every device combination and gathers the resulted performance statistics (as described in Section III-0b). The scheduler is using these collected results when processing the incoming network traffic, to select the optimal configuration, according to a user specified policy. In particular, we create a special worker, called monitor, that keeps track of the active configuration and manage how efficiently it distributes its resources to workers. The monitor executes periodically (using an ALARM signal) to (i) decide if the active configuration is still performing better than any other configuration, and to (ii) update the performance statistics of the current active configuration to match the most recent performance statistics of the system. By doing so, our scheduler is able to adapt to traffic rate changes quickly and also re-train itself over time.
A representative set of policies that we have implemented so far, include (a) throughput maximization in which we seek to optimize in terms of aggregated processing rate (typically at the cost of increased latency and power consumption; (b) latency minimization, that can be applied to real-time or latency-critical applications; and (c) energy consumption minimization.

Figures 4(a)-(d)
show the performance of our scheduler for a representative fraction of applications (i.e., AES and DPI) when (i) fluctuating network traffic rate and (ii) changing policies on-the-fly. For the former, we use a policy to handle all input traffic at highest energy efficiency. The traffic rate is low enough for a single device to cope with it, which results to significantly low power consumption. This is not the case in the second experiment, in which we seek the maximum possible throughput before aggressively switching to an energyefficient policy. For comparison, we also display the maximum power consumption when both devices are exhaustively used simultaneously. The observed variability in latency is the result of the dynamic scheduler decisions regarding the batching and device selection. Overall, our scheduler is capable to adapt to a highly diverse computational demand among different applications, producing live decisions that aim to maintain the maximum energy efficiency and to avoid excessive latency (besides the requested performance policy). a) Throughput: As shown in Figures 4(c) and 4(d), our system is able to process a constant traffic rate of almost 20 Gbps when a single applications is active and at about 30 Gbps in the scenario of two active applications (0-15 seconds mark). When the traffic rate varies, our scheduling schema manages to cope with up to 10 Gbps input traffic rate per application as shown in Figures 4(a) and 4(b) (0-20 seconds mark). When the traffic rate changes, such as the increase from 20 to 40 Gbps (Figure 4(b)), a second device (in this case the GTX 1080 Ti) is enabled to increase the computational capacity of the system. An interesting time interval exists between the 20th and the 30th second mark of Figure 4(a) when the discrete GPU is activated but is immediately deactivated, as the monitoring reveals that only the presence of the integrated graphics card can still cope with the incoming traffic. The GTX 1080 Ti is only re-activated when the traffic rate is doubled to 40 Gbps.   (Figures 4(a) and 4(b)) that only when an increase in the traffic rate occurs and more computational capacity is needed, the system activates an extra device at the cost of greater energy expenditure. c) Latency: An increase of the batch size usually results to higher throughput, but also to increased latency. Overall, we try to minimize latency up to a point where no interference with the requested policy occurs. For example, even when the goal is to maximize the overall throughput of the system (Figure 4(b)), during the second 20-seconds interval, latency remains considerably low despite the fact that the discrete GPU is active as the traffic characteristics demands so. The reason behind this is not only the presence of an extra device, but mainly because the system does recognize that an even larger batch size would not result in extra performance gains.

VI. RELATED WORK
A comparison of Pythia to the most relevant state-ofthe-art tools is shown in Table IV. GASPP [3] shows an extreme approach that delivers all packets directly to a highend GPU for processing, while APUNet [4] utilizes GPUs that are integrated in the CPU die, to alleviate the overhead of extra memory copies. Papadogiannaki et al [8] propose an adaptive scheduling approach that uses performance policies to determine the appropriate combination of devices for efficient execution of network packet processing applications. In this work, we extend this solution by enabling the multiplexing of different network functions across heterogeneous devices. Finally, there is ongoing work on providing performance predictability [12] and fair queuing [13] when running a diverse set of applications that contend for shared resources.

VII. CONCLUSIONS
We proposed an adaptive scheduling solution that enables real-time application multiplexing across heterogeneous processors, and is able to respond quickly to network fluctuations or system changes. As part of our future work, we plan to optimize the complexity of the offline analysis phase by taking into consideration, a-priori, the specific characteristics of the available processing devices.