Efficient Execution of Graph Algorithms on CPU with SIMD Extensions

Existing state-of-the-art CPU graph frameworks take advantage of multiple cores, but not the SIMD capability within each core. In this work, we retarget an existing GPU graph algorithm compiler to obtain the first graph framework that uses SIMD extensions on CPUs to efficiently execute graph algorithms. We evaluate this compiler on 10 benchmarks and 3 graphs on 3 different CPUs and also compare to the GPU. Evaluation results show that on a 8-core machine, enabling SIMD on a naive multi-core implementation achieves an additional 7.48x speedup, averaged across 10 benchmarks and 3 inputs. Applying our SIMD-targeted optimizations improves the plain SIMD implementation by 1.67x, outperforming a serial implementation by 12.46x. On average, the optimized multi-core SIMD version also outperforms the state-of-the-art graph framework, GraphIt, by 1.53x, averaged across 5 (common) benchmarks. SIMD execution on CPUs closes the gap between the CPU and GPU to 1.76x, but the CPU virtual memory performs better when graphs are much bigger than available physical memory.


I. INTRODUCTION
Modern CPUs and GPUs are increasingly similar. While CPUs are not intrinsically SIMD, they commonly support SIMD execution through the use of SIMD extensions or short vector instructions. Intel's recently added AVX512 extensions enables CPUs to process 16 32-bit integers or floats in one instruction, which is half the size of an NVIDIA GPU warp.
Until recently, using these CPU SIMD extensions meant using SIMD intrinsics or relying on compiler autovectorization. With the insight, attributed to Tim Foley, that "autovectorization is not a programming model" [1], and that scalar programming models like CUDA were more suitable for generating SIMD code, there is now a third alternative for using the SIMD extensions on CPUs. This programming model, implemented by the Intel SPMD Compiler [2], and referred to as ISPC in this work, allows programmers to write mostly scalar code with some annotations that lead to the generation of multi-threaded SIMD code. This is both portable and highly controllable.
While state-of-the-art parallel CPU graph frameworks [3]- [18] have successfully increased efficiency of running on multiple cores, they do not make use of SIMD-style execution. The availability of ISPC allows the use of GPU programming model paradigms on CPUs. This prompts our work, where we take a state-of-the-art GPU compiler for graph algorithms -the IrGL compiler [19] -and retarget it to ISPC. Not only does this allow us to generate native SIMD versions of graph algorithms, it also allows us to study the effectiveness of the optimizations originally developed for GPUs and to directly compare CPUs and GPUs, as we run the same graph algorithms with the same optimizations applied on both the devices. Since GPUs now also support virtual memory, e.g. NVIDIA's Unified Virtual Memory (UVM), we can also compare the virtual memory systems of the CPU and the GPU. This paper makes the following contributions: • We present the first framework for SIMD CPU algorithms based on a recent GPU compiler for graph algorithms. On average, compared to state-of-the-art CPU graph frameworks, Ligra, GraphIt, and Galois, the optimized multi-core SIMD implementation is 3.06x, 1.53x, 1.78x faster, respectively. • We show that GPU-oriented optimizations also work for SIMD on CPUs delivering an additional 1.67x improvement over the naive SIMD implementation. • We evaluate the effects of different SIMD variants (AVX, AVX2 and AVX512), finding that wider SIMD extensions may not always be better. • Our work enables direct comparison of CPU SIMD and GPU graph algorithms which shows that the GPU only outperforms our fastest CPU by 1.52x after use of SIMD. Furthermore, for large graphs that do not fit in GPU memory, we find that the GPU's virtual memory subsystem can cause dramatic slowdown (more than 5000x) on our GPU making it totally unusable.
The rest of this paper is organized as follows. Section II provides necessary background and briefly discusses related work. Section III discusses the challenges and solutions of efficiently executing graph algorithms on CPUs in SIMD fashion. Section IV presents the performance evaluation results and compares CPUs and GPUs. And finally, Section V concludes the paper.

II. SIMD GRAPH ALGORITHMS ON CPUS
We describe the challenges in writing SIMD graph algorithms, and introduce the ISPC programming model, and provide a brief review of related work.

A. SIMD Graph Algorithms
Graph algorithms raise significant challenges for SIMD implementations. Manually writing SIMD code -either via assembly language or through the use of SIMD intrinsics -is mechanical, tedious, and ultimately results in code that lacks portability. If the SIMD instruction set changes then code must be rewritten. On the x86, which has gone through at least six different short vector extensions so far -MMX, SSE, SSE2, AVX, AVX2, and AVX512 -manually targeting SIMD can be particularly unproductive.
Auto-vectorization, which parallelizes scalar code and generates SIMD code does not suffer this portability problem. Although most modern compilers such as GCC and LLVM support autovectorization, the sparse data structures used by graph algorithms (e.g. compressed sparse row [CSR]) lead to heavy use of indirect memory accesses in the code. These indirect memory accesses and accompanying irregular loop structures hinder most auto-vectorization algorithms. Recent work has made considerable progress in tackling sparse data structures [20]- [22], but these remain unavailable in production compilers.
GPU SIMD implementations do not have deal with these problems since they use a mostly scalar, explicitly parallel programming model. When combined with the GPU's hardware support for handling control and memory divergence, this programming model neither requires use of SIMD intrinsics nor auto-vectorization support. CPU SIMD instruction sets have recently added similar hardware functionality. The Intel Haswell architecture's AVX2 instruction set includes dedicated gather load instructions. The AVX512 instruction set, introduced in the Xeon Phi x200 and Skylake-X server CPUs, added eight opmask registers that allows individual SIMD lanes to be masked (i.e. predicated) for most instructions simplifying the handling of control divergence. It also added dedicated scatter store instructions. This leads to a new model, implicit SPMD, for SIMD algorithms.

B. Implicit SPMD
Traditionally, data-parallel programs that use multi-process or multi-threaded execution can be written in the single program multiple data (SPMD) style. In SPMD, parallel tasks are created to independently execute a single program, with data decomposition [23] used to distribute data to each parallel task. SIMD, on the other hand, operates at the instruction level and therefore each SIMD "task" belonging to the same instruction executes in lockstep. GPU programming models, like CUDA and OpenCL, showed that it is possible to write SIMD programs that execute in a SPMD fashion while ostensibly writing scalar code.
The Intel Implicit SPMD Program Compiler (ISPC) [2] is a compiler for a C-like language that implements a similar SPMD+SIMD programming model for CPUs. SPMD execution is achieved using multiple threads, while SIMD execution is obtained by using the SIMD extensions on CPUs. Since the CPU also supports purely scalar execution, unlike most GPUs, ISPC differs from GPU programming models and natively supports scalar execution. In contrast, efficient scalar execution of GPU programming models usually requires sophisticated analyses [24], [25]. The primary challenge for ISPC is to achieve SIMD execution from a primarily SPMD code, while maintaining the illusion that multiple SIMD lanes are executing independently.
ISPC programs are explicitly parallel. The basic building block is a program instance that is mapped to SIMD lanes. Multiple program instances form an ISPC task which is usually executed by an OS thread. Table I links these to their CUDA  counterparts, note there is no ISPC equivalent for a CUDA  thread block. Listing 1 sketches a simple ISPC program that calculates the sum of an array of integers. It is spread across two source files: a main program written in C/C++ and an ISPC source file containing the computation kernels. The ISPC kernels are started by calls from the main program, similar to calling a regular function. To take advantage of multi-threading (in addition to SIMD), ISPC also supports launching multiple tasks through the launch statement. The launch statement invokes an underlying tasking system that creates multiple OS threads on to which tasks are mapped.
The ISPC language used for writing kernels is C-like, except that all variables are treated as vectors and implicitly declared as varying -meaning each program instance sees a different value for the variable. Thus, they behave like vectors and naturally, they will be handled by SIMD instructions. Scalar variables, which are shared by all program instances in the same task, are declared as uniform. 1 In Listing 1, all arguments to the ISPC kernels are marked as uniform. The range of array indices that will be summed by each task is then calculated by splitting up the total work among all the tasks launched (size_per_task), and using that to identify the start of the task's block of work. The four built-in variables -programIndex (identifies program instance), programCount (SIMD width), taskIndex (task identifier), and taskCount (total tasks launched) -can be used by ISPC tasks to carve out work using data decomposition.
The variable sum is varying, and is local to each program instance. The foreach statement is used to indicate a parallel loop. In this case, the loop generates vector instructions that add each vector element of sum to a corresponding element of array. As all accesses to array are consecutive, a standard vector load can be used. However, in the general case, a gather instruction can be generated to load values from array. Once the foreach loop has completed, the ISPC library function reduce add is used to reduce the contents of the vector sum into a scalar, which is then added to the scalar variable out using an ISPC library function atomic add global to perform the addition atomically across all executing tasks.
Although a ISPC-like programming model could have been built before, several recent hardware features have made a compiler viable. The primary causes of "divergence" in SIMD programs are memory accesses to non-consecutive locations and divergent branches. Hardware support for gathers and scatters deals with the former, whereas support for predication/masking allows dealing with the latter. Of course, ISPC can always generate purely scalar code where hardware support is missing.
C. Related Work 1) SIMD Graph Algorithms on CPUs: Vectorized garbage collection [26] used a vectorized breadth-first search (BFS)based garbage collection algorithm on the Cray-2. A more recent proposal [27] describes a set of graph processing APIs that express SIMD parallelism on Intel Xeon Phi. Another proposal [28] explores manually vectorizing and optimizing BFS on Phi, achieving about 1.25x speedup over the non-SIMD implementation. In addition, AVX intrinsic accelerated set intersection operation [29] was proposed for a number of graph algorithms, including triangle counting, clique detection, and subgraph matching, achieving 3.6x, 12,7x, and 3.3x speedup, respectively. A variety of work also discuss vectorized set intersection operations [30]- [32].
2) Non-SIMD Graph Algorithms on CPUs: A large number of graph processing frameworks and systems have been proposed for CPUs [3]- [18], [33]- [38], which focus on thread-level parallelism, rather than SIMD. Notably, Ligra [3], Ligra+ [39], Julienne [40], and the most recent GBBS [41] is a line of state-of-the-art work that resembles one type of graph frameworks -C++ template libraries. Galois [42] is also a state-of-the-art graph library implemented in C++. In particular, it has a large collection of graph problems and a variety of algorithms for each problem. GraphIt [4], [43], [44] is another line of state-of-the-art work that resembles another type of graph frameworks -domain-specific language plus an optimizing compiler. It also allows each benchmark to be fine-tuned by "schedules" supplied by the programmer.
3) Graph Algorithms on GPUs and Accelerators: A large number of graph systems have been proposed for GPUs and accelerators [45]- [64]. They are available in libraries and domain-specific language plus compilers as well. In particular, our paper extends the IrGL compiler for irregular graph algorithms on GPUs [19] to CPUs. Figure 1 highlights the contribution of this work, which we term EGACS (Efficient Graph Algorithms on CPU SIMD). Shaded boxes represent the new components added to the IrGL compiler, and dashed boxes represent components that are already in IrGL, but modified to incorporate our new ISPC backend. Our work keeps the IrGL DSL frontend, but adds a new backend for Intel ISPC, allowing IrGL programs to be executed on CPUs using SIMD extensions. Graph benchmarks and APIs are also from GPU IrGL.
The use of the ISPC compiler allows our work to automatically support all targets that Intel ISPC supports, including SSE and AVX on x86, and NEON on both 32-bit and 64-bit ARM processors. However, in this work, we focus only on AVX extensions on x86 processors, and leave evaluation of ARM NEON to future work.

III. CPU SIMD CHALLENGES AND SOLUTIONS
In this section, we examine several performance issues in efficient utilization of SIMD units on CPUs, and show that some of them can be tackled by adapting optimizations first described for the GPU while others require new optimizations.

A. Task Launching
Many graph algorithms are iterative, re-executing a computation a large number of times until convergence. For BFS, the minimum number of iterations is the diameter of the graph, which for some graphs can be thousands of iterations.
In ISPC, each iteration requires tasks to be launched, equivalent to launching kernels on GPUs. While spawning CPU threads or assigning computation tasks to worker threads is not as expensive as launching GPU kernels, it is not free either. Tasks are an ISPC construct, and have different implementations that can be selected at compile time. On Linux, ISPC supports a standard pthread-based [65] task system with 3-D task launching, a simplified version called "pthread fs" that only  supports 1-D task launching, as well as task systems based on Cilk [66], OpenMP [67], and Intel Thread Building Blocks (TBB) [68].
To characterize the overheads of task launches in ISPC using different task systems, we execute a microbenchmark that only launches tasks that do nothing (i.e. "empty" tasks). Table II compares the time per launch averaged over 10000 continuous launches for all supported task systems. For this test, the number of tasks launched is equal to the total number of hardware threads on the machine. We observe that the pthreadbased task system shows the highest launch time, while Cilk shows the lowest launch time. Table III shows the results of repeating this experiment with a real benchmark, BFS-WL on USA-road graph, with and without the optimization called Iteration Outlining (IO, described later) that removes task launching overhead. These results show that in real programs, OpenMP has the lowest overhead but all systems still impose significant overheads. We then observe that optimization successfully removes that overhead, making the total execution time nearly the same regardless of task system. Because task launching is on the critical path, lowering its overhead also improves scalability -BFS-WL with USA-Road with 16 tasks on a 8-core SMT-enabled machine shows a 4.59x speedup only when the optimization is applied, compared to no speedup over a single-thread SIMD version.
To tackle both launch overhead and lack of scalability, we retarget the IrGL optimization called Iteration Outlining (IO) that reduces the number of task launches to one. Listing 2 illustrates how this optimization is applied to BFS-WL in ISPC. In the input IrGL code, the iterative loops are represented by a language construct called a Pipe [19]. The default translation of Pipe results in a C++ loop that launches tasks for each iteration. To reduce the number of task launches, however, we transform the iterative loop so that it is "outlined" to ISPC code. The transformed code launches the ISPC bfs loop kernel, which in turn uses a loop to call the original bfs kernel. There is only one task launch in this case. To maintain the original semantics of task launches, we insert barriers after each original kernel invocation.

B. Improving SIMD Lane Utilization
Many graph algorithms contain nested loops, the outer of which iterates over all nodes of a graph, while the inner loop iterates over edges of a particular node (e.g. Listing 3). In ISPC, each outer loop iteration gets mapped to an ISPC program instance (i.e. a SIMD lane). All iterations of the inner loop are then executed by a single SIMD lane. So unlike regular algorithms like dense matrix multiply, the amount of SIMD parallelism available is dynamic and depends on the input graph. This can lead to SIMD lane underutilization since the number of inner loop iterations are usually different across SIMD lanes. Table IV shows that SIMD lane utilization when executing the inner loop (Line 8 in Listing 3) is around 64% for USA-Road graph and 32% for RMAT22 graph.The USA-Road graph has low average degree and is relatively uniform, whereas RMAT22 has higher average degree and is highly skewed. After applying optimizations detailed later in this section, lane utilization increases to 82% and 84% for the inner loop for the two inputs respectively. We also observe that applying these optimizations reduces the number of dynamic instructions significantly (e.g. 18x for RMAT22) despite the additional code for scheduling. These primary goal of these optimizations is to increase the amount of work per program instance (Fibers) and address load imbalance (Nested Parallelism/NP). 1) Fibers: Unlike CUDA, ISPC has no scheduling construct corresponding to thread blocks or shared memory. Since graph algorithms have low arithmetic intensity, we found it advantageous to emulate thread blocks in ISPC to increase the amount of work per ISPC program instance. Thread block emulation is done through fibers, which emulate multiple ISPC tasks on the same OS thread. We implement this by inserting additional loops around the actual work loop. The number of loop iterations determines the number of fibers, and data for each virtual task is stored in local, per-task arrays. This is somewhat similar to thread-coarsening in GPU kernels [69].
Fibers enables emulation of CUDA shared memory and the CUDA thread block level syncthreads barrier. Shared memory is emulated by placing variables before any of the fiber loops, thus all fibers have access to that location. The fiber loops need to be partitioned at each syncthreads [70].
When fibers are enabled, each ISPC task now corresponds to a CUDA thread block, while the fibers correspond to CUDA warps. CUDA threads are iterations of the fiber loop, which we refer to as virtual program instances.
To choose the number of fibers per ISPC task, we must balance fixed scheduling overheads as well as resource consumption for fiber-specific state arrays. We opt to use a dynamic calculation: N umF ibersP erT ask = MIN(M axN umF ibersP erT ask, N umOf ItemsInW L SIM DW idth × NumOfT asks ) Here, M axN umF ibersP erT ask is the maximum number of possible fibers per task, set empirically to 256 to limit resource consumption while maximizing average speedup. N umOf ItemsInW L is the current number of worklist items. SIM DW idth is 8 for AVX2 or 16 in AVX512. The NumOfT asks is the number of launched ISPC tasks, usually the number of hardware threads. Each task computes N umF ibersP erT ask before beginning the actual computation. This way, we only launch one ISPC task per hardware thread which then iterates over all the work.
2) Nested Parallelism: Nested parallelism (NP) redistributes inner loop iterations across ISPC program instances to reduce load imbalance which in turn improves SIMD lane utilization.  instances. Medium-degree node edges (C, D) are executed by real program instances. The edges of the remaining lowdegree nodes (e.g., A, E) are packed using a prefix sum and redistributed across all program instances. Our compiler inserts the proper inspector-executor scheduler code to achieve this. This design closely follows the the CUDA backend, with the ISPC fiber, task, and fine-grained schedulers mirroring the CUDA thread block, warp, and fine-grained schedulers.

C. Atomic Operations
Work-efficient graph algorithms use a worklist to track active nodes in the graph. Since this worklist is updated concurrently by program instances, the use of atomics is necessary.
Unlike GPUs, the CPU hardware does not yet provide native vector atomic instructions. However, ISPC provides built-in atomic functions that can act as vector atomic instructions. These atomic functions come in local and global variants. The local atomic functions only provide atomic guarantee within a single task. Since program instances are executed in lock-step fashion in a task, these atomic functions are implemented by a simple loop over all active program instances and do not require a hardware atomic.
In contrast, the global atomic functions provide atomicity guarantees across tasks and can be classified into three types depending on the source and destination operands. For the case of atomic operations on scalar memory locations with scalar input value, the translation to hardware is direct. For atomic vector-to-vector operations, a loop uses hardware scalar atomic instructions to update each memory location for each active program instance. The final case involves atomic updates from a vector value to a scalar location. These are implemented by first performing a vector reduction on the input vector within a task to obtain a scalar, and then issuing a scalar atomic operation to the scalar memory location. Since atomics are executed serially in hardware, reducing the number of atomics can improve performance.
To reduce the number of atomics executed, the Cooperative Conversion optimization transforms code so that multiple program instances aggregate their worklist pushes locally and  eventually perform one atomic for the aggegrated push. In the original IrGL GPU compiler, aggregation is performed at the warp and thread block levels. In the CPU compiler, we also aggregate atomics unconditionally at the task level across program instances. This is done by replacing an worklist push with an aggregate version as shown below: First the ISPC built-in functions popcnt(lanemask()) obtains the number of pushes by counting the number of active program instances. Then, the ISPC built-in atomic function atomic add global reserves space for all pushes using a task level atomic scalar atomic add global. Lastly, the ISPC builtin function packed store active takes values to be pushed from active program instances, packs them, and writes them to consecutive memory locations in the reserved space.
Since ISPC has no concept corresponding to a CUDA thread block, we instead implement cooperative conversion on fibers, illustrated in Figure 3. For some algorithms, total number of items that get pushed into the worklist can be calculated in advance. In such case, only one atomic add is required with the help of fibers. Unlike the GPU, since fibers still belong to the same ISPC task, lockstep execution is guaranteed permitting a more efficient implementation of fiber-level cooperative conversion. Table V shows how the number of atomic worklist pushes is reduced after cooperative conversion. We always enable nested parallelism since it increases the number of active program instances increasing the effectiveness of cooperative conversion. Cooperative conversion significantly reduces the number of atomic pushes over unoptimized for all benchmarks where it is applicable. Fiber-level aggregation is only applicable in two benchmarks, bfs-cx and bfs-hb, where the number of atomic pushes is further reduced by 36.5x for BFS-CX (for a total of 125x).

D. Gather Loads and Scatter Stores
In early iterations of SIMD extensions, vectorized memory accesses to and from non-consecutive addresses were not supported. Gather instructions, which load from nonconsecutive addresses, were first introduced in AVX2 while scatter instructions, which do the same for stores, were introduced in AVX512. These instructions take a 64-bit generalpurpose register as base address and a vector register that provides offsets. AVX2 supports eight 32-bit offsets or four 64bit offsets; AXV512 doubles this. Without these instructions, the ISPC compiler must generate scalar loops to perform gathers and scatters.
Gather and scatter operations are crucial for SIMD graph algorithms. Indirect memory accesses (i.e. a[b[i]]) are common in many graph formats, such as CSR, where they are used to locate the edges for each node. Although AVX2 and AVX512 provides specialized instructions to handle gather and scatter operations, we found performance issues compared to their scalar variants. We use a microbenchmark that loads random cache lines from a pre-allocated array using scalar or gathers. In the gather, each SIMD lane accesses a different cache line. We size the array to correspond to a particular level of cache and make random accesses. This makes it very likely that most accesses are satisfied from that particular cache level.
Table VI summarizes the results. Each row represents a vector or scalar load configuration. Scalar1 through Scalar32, or simply Scalarm, consecutively load from m scalar addresses to different scalar registers. For Scalar16, this introduces register spills, and the latency shown here incorporates this, and is not load instruction latency alone. AVX2 and AVX512 are the average load-to-use latency of a single 32-bit word using AVX2 and AVX512 gather loads respectively. Multiplying by the SIMD width will yield the load-to-use latency of the entire instruction.
We observe that the average per-word latency for AVX2 (1.02ns under L1 hit) is much higher than Scalar8 (0.30ns under L1 hit), and this holds on both our Intel and AMD machines. The AVX2 gather cannot complete until all eight 32bit words finish loading, while out-of-order execution allows the Scalar8 load instructions to proceed as soon as the individual memory word is fetched.
Only on the Phi processor is the per-word latency of AVX512 (0.98ns) lower than Scalar16 (1.51ns) for L1 hits. The Phi is not as aggressive an out-of-order core as the Intel or AMD processors, and hence it is unable to hide the scalar load latency. Gather loads are also a problem on GPUs, where they cause memory divergence. Current GPUs use very high degree of simultaneous multithreading (SMT) to hide load latencies and do not implement out-of-order execution. For example, most NVIDIA GPUs run 64 warps per streaming multiprocessor. CPUs do not implement such high degree of SMT, but we find that 2-way and 4-way SMT help alleviate some of this issue.

IV. EVALUATION
In this section, we evaluate the performance of our SIMD graph algorithms. We also compare our implementation against state-of-the-art proposals including Ligra, GraphIt and Galois. We mainly use two machines, one with an Intel Xeon Silver 4108, the other with an AMD EPYC 7502P. We also run our SIMD graph kernels on a machine with an Intel Xeon Phi 7290 processor. Table VII summarizes machine configurations.  Table VIII lists the kernels used in our evaluation. Note that for ISPC, we run multiple variants of BFS. When comparing against other proposals, we use BFS-WL. For SSSP-NF, we use the same input-specific DELTA across all proposals. To make sure our kernels execute correctly, we collect the outputs and check them against the reference output. We also modify Ligra and GraphIt to make sure they produce the same outputs. We also carefully place the timers in all kernels to make sure they measure the same type of useful work across different proposals. We run each kernel 5 times on Phi and GPU, 20 times on Intel and AMD; and then report the average execution time, excluding graph loading and output writing time. We use a pthread-based ISPC tasking system, derived from the stock ISPC pthread tasking system, adding the capability to pin tasks on specific hardware threads. Unless otherwise specified, on the Intel machine, we use 16-wide AVX512 target and launch 16 ISPC tasks; on AMD machine, we use 8-wide AVX2 target and launch 64 tasks.
All frameworks use 32-bit node and and edge indices with 64-bit pointers. We compile Ligra, GraphIt, and Galois with GCC using -O3 which enables auto-vectorization by default. Disabling auto-vectorization using -fno-tree-vectorize had no noticeable performance impact. Ligra and GraphIt use Cilk as their underlying task system. For EGACS, we pin threads on physical CPUs. This is for studying scalability and SMT, not for performance. We found pinning alone speeds up EGACS by 2% on average. Pinning was not used for other systems. We use three input graphs: a uniform degree planar USA-Road (23M nodes, 46M edges), a scale-free graph RMAT22 (4M nodes, 33M edges), and uniformly random graph Random (8M nodes, 33M edges).

A. Comparison with Scalar Graph Frameworks
We compare the performance of EGACS to other CPU graph processing frameworks. We also evaluate the contributions of each optimization to overall performance. 1) Overall Performance: Figure 4 shows the speed up of EGACS (SIMD kernels with all optimizations enabled), Ligra, GraphIt, and Galois over the serial version. Execution times are in Appendix A. The serial version is derived from our ISPC code by marking all variables uniform and setting task_count and program_count to 1, and recompiling.
On the Intel machine, compared to other state-of-the-art proposals, in total 21 benchmark-input comparisons, our EGACS ranks the fastest in 11 benchmark-input configurations, second fastest in 9 configurations, and is slowest only for some inputs of MIS and MST. Using USA-Road graph, EGACS on average is 10.87x faster than Ligra and 3.77x faster than GraphIt (excluding MST and MIS), 1.95x faster than Galois. EGACS is also the fastest when using Random graph as input except in BFS and SSSP. Ligra performs very well on RMAT22 and Random graphs, for example, Ligra on BFS-WL with RMAT22 graph is 2.08x over EGACS.
On the AMD machine, similar trend is observed. In total 21 benchmark-input comparisons, EGACS ranks the fastest in 9 configurations, second fastest in 8 configurations. Averaged across all common benchmarks and inputs, EGACS is 2.18x faster than Ligra, 1.54 faster than GraphIt, and 1.55x faster than Galois.
Ligra and GraphIt implement direction-optimizing BFS [71] which is fundamentally faster than BFS-WL implemented in EGACS for these graphs (but not for the USA-Road graph). Our implementations lack algorithmic optimizations including not only direction optimization, but also bitvector representation, blocking or vertex scheduling, and asynchronous execution found in the other three graph systems. These algorithmic optimizations make them fundamentally faster than EGACS. In addition, PR and MST are affected by the extensive use of cmpxchg in EGACS. However, our focus in this work is on better using machine features, i.e. SIMD, rather than algorithmic improvements.
2) Effect of Throughput Optimizations: Figure 5 shows the effect of individual optimizations on the Intel machine. Task-level cooperative conversion (CC) is always applied in conjunction with nested parallelism (NP). Among our benchmarks, only BFS supports fiber-level CC.
On average, applying all optimizations yields the highest speedup over the unoptimized SIMD version. On the Intel machine, when all optimizations are applied, the average speedup for USA-Road graph is 2.10x, 1.77x for RMAT22, and 1.24x for Random. For each individual kernel and input, speedup ranges from 0.62x to 6.13x. Similar trend is observed on AMD and Phi [72].
It is known that individual optimizations can slow down performance on GPUs [73], and this seems to hold on the CPU as well. For example, Fibers works best for BFS-CX and BFS-HB, due to the significant atomic worklist push reduction. However, in BFS-WL and SSSP, the performance benefit is not enough to cancel out the overhead from executing additional Fibers code, slowing down the programs.
Some optimizations change the order of work affecting memory locality. For example, IO+CC+NP shows quite noticeable slowdown for USA-Road and Random graphs on the CC benchmark. In this case, the generated loop iterates over items in an order that happens to suffer from poor locality, leading to high cache and TLB miss rates. Interestingly, when Fibers is enabled on top of IO+CC+NP for CC, the slowdown goes away because Fibers changes the iteration order, bringing back memory locality.

B. Impact of SIMD
We evaluate the contribution of SIMD, Multi-tasking and throughput optimizations. Then, we evaluate the effects of SIMD width and AVX version.
1) SIMD vs. Multi-tasking: Speedup over serial is due to both SIMD execution on each task as well as running multiple independent tasks. We evaluate the individual contributions of these two sources. Figure 6 summarizes the contributions of multiple tasks and the use of SIMD. Enabling SIMD alone is beneficial, resulting in 1.37x speedup for USA-Road , 1.45x for RMAT22, and 2.15x for Random over the serial version. Similarly, enabling multi-tasking alone is beneficial on average (1.47x for USA-Road, 4.56x for RMAT22, and 4.26x for Random). However, +MT slows down BFS-CX and SSSP with USA-Road since additional atomic operations and barriers required for multitasking affect scalability.
With both SIMD and multi-tasking enabled (+SIMD+MT), the number of atomics is reduced due to local task-level aggregation, and we see a significant performance boost (3.84x for USA-Road, 7.96x for RMAT22, 13.71x for Random) Adding throughput optimizations (+MT+SIMD+Opt) yields the best performance on average (8.06x for USA-Road, 14.08x for RMAT22, and 17.02x for Random).
2) SIMD Width: Figure 7 summarizes the effect of SIMD width. Solid lines represent speedup of a particular AVX target over AVX1-4; dotted lines represent number of dynamic instructions normalized to AVX1-4. To avoid contributions from barriers, task launching, and compare-and-swap retries, we measure the number of dynamic instructions using a singletask SIMD version. For speedup, we still launch multiple ISPC tasks, i.e.16 tasks on Intel machine. Since all benchmarks show similar trend for a particular input, each line represent the geomean across all benchmarks for a particular input graph.
On average, wider SIMD is beneficial for USA-Road and Random. Compared to AVX512-8, AVX512-16 results in 1.15x and 1.07x speedup, respectively. However, AVX512-16 and AVX512-8 have roughly the same execution time for RMAT22. Since RMAT22 tends to have full SIMD lanes due to high average degree, a slow 16-wide gather load can prevent future dependent instruction from being issued.
For AVX2, AVX2-16 yields 1.17x and 1.61x speedup over AVX2-8 and AVX2-4, respectively, for USA-Road. For RMAT22, AVX2-16 and AVX2-8 have roughly the same execution time. Since AVX2 natively supports only 8-wide 32bit integer operations, ISPC simulates 16-wide target by issuing two consecutive 8-wide vector instructions. Therefore, gather loads do not affect AVX2 as much since the two consecutive instructions are guaranteed to be independent and have higher instruction-level parallelism (ILP). However, we observe that AVX2-16 requires more dynamic instructions than AVX2-8, which reduces the benefit of higher ILP for RMAT22, leading to no speedup for AVX2-16 over AVX2-8.
AVX512-16 has fewer dynamic instructions than AVX512-8. For AVX2 and AVX1, the 8-wide version has the least number of instructions. Their 4-wide target under-utilizes the 8-wide SIMD units, while their 16-wide target requires additional instructions. Both result in higher number of instructions over the 8-wide version.
3) AVX Version: Figure 7 also summarizes the effect of different AVX versions. Newer versions of AVX have significant reductions in total dynamic instructions (including scalar) over older versions. For USA-Road, AVX512-16 has 1.41x fewer instructions than AVX2-16 which in turn executes 1.59x fewer instructions than AVX1-16. These reductions are due to the new instructions (such as gathers, scatters) and predication.
The reduction does not always translate to higher performance. For example, AVX512-16 is slower than AVX2-16 for Random and RMAT22 due to microarchitectural implementation. Our Intel processor has only one 16-wide SIMD unit  composed of two 8-wide units. The AVX512-16 instructions use the whole unit, whereas the AVX2-16 use two 8-wide independent instructions. This means the AVX2-16 can benefit from both higher ILP as well as lower gather latency.
C. Scalability Figure 8 shows speedup over serial version as the number of tasks (and hence cores used) is increased. Tasks are pinned to cores, and we do not use SMT, which will be studied later (Section IV-D1). Here we show the geomean across all three inputs since they all have similar scaling trend. As can be seen, the benchmarks scale pretty well, as they show nearly linear scaling for Intel, for AMD with 16 or fewer cores, and for Phi with 18 or fewer cores. Beyond that, the scaling stops for a number of benchmark and input combinations. However, on average, we still see a significant speedup with 32 cores over 16 cores (1.27x) on AMD. On Phi, the 36-core version on average still shows further speedup (1.28x) over 18-core version. However, when all 72 cores are enabled, we see a slowdown over the 36-core version (0.89x). Notably, use of SIMD contributes additional scaling, leading to maximum speed ups of 65.12x on Intel (8 cores), 131.55x for AMD (32 cores) and 111.85x for Phi (36c).

D. CPU vs. GPU
Our experimental setup allows us to compare CPUs against the GPU, running the same kernels generated by the same compiler for the CUDA backend. To ensure a fair comparison, we also include the time for data transfers to and from the GPU. We use a Quadro P5000, which has 20 32-wide streaming multiprocessors. Figure 9 shows the speedup over our EGACS from Intel machine.
On average, GPU is the fastest. The GPU speed up ranges from 0.06x to 6.81x for USA-Road, 0.30x to 3.51x for OOM c a Using OSM-EUR graph: 174M nodes, 696M edges, 3.89 GB b Did not finish after 5 hours. The slowdown is more than 5000x. c Killed due to out-of-memory. The system has only 5.9GB swap space. RMAT22, and 0.31x to 3.13x for Random, compared to the Intel CPU. However, there are three exceptions where the Intel CPU is notably faster, BFS-TP (USA-Road) and MST (RMAT22 and Random). For BFS-WL(USA-Road) , BFS-CX(USA-Road), and SSSP (USA-Road), AMD is slightly faster than CUDA. Although the specifications of Phi look close to our GPU, this does not always translates to performance. On average, Phi is slower than the GPU. However, we found Phi faster in 8 out of 30 benchmark-input combinations with speedup ranging from 1.31x to 16.80x.
The GPU No Data Transfer results do not include the time to copy data to and from the GPU. Without this overhead, unnecessary on the CPU, the GPU is the fastest except for a few benchmarks. Phi is faster than GPU for TRI (USA-Road and Random), MIS (USA-Road and Random) by 1.10x, 1.38x, 1.14x, and 1.29x; Intel is faster than GPU for MST (RMAT22 and Random) by 3.22x and 3.23x, respectively.
1) Effect of Virtual Memory: Recent GPUs support virtual memory [74], lack of which has previously limited graph sizes, permitting us to evaluate the virtual memory capabilities of GPUs for graph algorithms .
Since all our frameworks use 32-bit addressing for graph data structures, we add a larger OSM-EUR graph (174M nodes, 696M edges, 3.89 GB) and limit the amount of physical memory available to a benchmark. This methodology also allows us to study the effect of multiple simulated physical memory sizes using the same input. On the CPU, we use cgroups to limit physical memory, while on the GPU we run a separate process to allocate GPU memory using cudaMalloc (whose allocations are non-pageable) making it unavailable to the benchmarks. Table IX shows the memory footprint size of graph analysis on the CPU and GPU version without memory limitations. For our experiments, we limit the physical memory to 75% and 50% of this footprint size, and report normalized execution time. BFS-WL, SSSP, and PR ends up with dramatic slowdown on the GPU, making the system totally unusable, but only moderate slowdown on CPU. When limiting physical memory to 75% footprint size, CPU versions in general show higher slowdown than the GPU, except for BFS-WL, SSSP, and PR. When limiting physical memory to 50% footprint size, only MIS shows higher slowdown on the CPU than the GPU. Except for the two OOM errors in CC and MST, benchmarks on the CPU do not suffer from catastrophic slowdown.
2) Effect of SMT: We evaluate SMT by launching as many ISPC tasks as hardware threads and pinning each task to a dedicated hardware thread. For our "no-SMT" experiments where SMT is not used, we pin tasks to ensure only one task runs per core. Figure 10 shows results from Intel, AMD, and Phi. Solid lines represent SMT speedups over no-SMT with the same number of cores enabled. Dotted lines represent speedups over serial version with and without SMT enabled, for a given number of cores. Since all three inputs show almost identical scaling trend, we show the geomean across all inputs.
On average, SMT benefits our benchmarks. For example, with 2 cores, enabling SMT gives us 1.61x, 1.93x and 3.54x speedup over no-SMT on Intel, AMD, and Phi, respectively. However, as the number of cores increase, the benefits decrease -speedups from SMT are only 1.52x and 1.07x when all cores are enabled for Intel (8 cores) and AMD (32 cores). Indeed, on Phi, when all 72 cores are used, we end up with slowdown (0.58x).Preliminary experimental results show that L3 latency on the AMD machine for MIS/USA-Road increases 2.30x from 16-thread to 32-thread configuration pointing to increased memory contention as the cause for poor scaling.
V. CONCLUSION This work has presented the first compiler for CPU SIMD graph algorithms by extending an earlier GPU graph algorithm compiler to generate ISPC code. We have shown that the fastest SIMD implementations outperform the state-of-the-art non-SIMD CPU graph frameworks, Ligra, GraphIt, and Galois by 3.06x, 1.53x and 1.78x, respectively. Since SIMD can be added to these frameworks, our work is complementary to these proposals. The GPU optimizations have been adapted for CPUs, resulting in an additional 1.67x speedup over just SIMD. Comparing the CPU and GPU versions of the same algorithm, we found that even accounting for data transfers the GPU is 1.24x (Phi) to 1.76x (Intel) faster, though its virtual memory subsystem needs improvement. Table X lists the absolute execution time in milliseconds comparing EGACS, Ligra, GraphIt, and Galois on both Intel and AMD machines. We use AVX512 on the Intel machine and AVX2 on the AMD machine. Times in bold are the lowest execution time for a particular benchmark-input configuration.

A. Abstract
Our artifact provides sources for all evaluated benchmarks and the extended IrGL compiler that compiles the benchmarks, along with a set of scripts to run experiments and generate execution data and speedup graphs. We also include Ligra, GraphIt, and Galois sources that we used in writing the paper. The GPU version of our benchmarks are also included. Using these sources and scripts, users should be able to run our experiments on any supported machine and generate result figures and tables.
In addition, we include reference results and plot scripts. These are what we used in writing this paper. They can be used to generate the exact same plots and tables as in the paper.

B. Artifact Checklist
• Algorithm: AVX SIMD graph code generation.  2) Detecting ISPC Target: We provide a script that automatically detects the best ISPC AVX target on current machine and sets up the Makefile.
$ cd $ARTIFACT_ROOT/ggc-ispc/skelapp-ispc $ python detect_target.py To specify a custom target other than the default one, please refer to "ISPC target" paragraph in "Experiment customization" section for details.

E. Experiment Workflow
The default experimental setup is intended for use on a single-socket multi-core 2-way SMT machine.
For a multi-socket machine, a CPU that has no SMT, or more than 2-way SMT, the script may have be modified to instruct EGACS on how to pin threads on the machine. Please refer to "Experiment customization" section for details.
After running the experiments, raw performance numbers, generated figures, and tables will be placed in the following directories, respectively. $ARTIFACT_ROOT/ggc-ispc/eval/results $ARTIFACT_ROOT/ggc-ispc/eval/plot/plots $ARTIFACT_ROOT/ggc-ispc/eval/plot/tables 1) EGACS Results: Fig. 5  2) Scalability Results: Partial Fig. 8 and 10. The original Fig. 8 and 10 consist of results from 3 different machines. The artifact scripts generate results for the machine on which EGACS is running. The default setup assumes a 8-core 16thread single-socket Linux machine. Otherwise please refer to "Experiment customization" section on how to modify the script. (Makefile.thread).

G. Experiment Customization
The experiments are fully customizable. For example, users can enable/disable the desired IrGL optimizations, set number of threads and pinning policy, and change ISPC AVX target.
1) IrGL Optimizations: Predefined optimization combinations are in ggc-ispc/bmks/Makefile.ispc. Users can simply add a Makefile target following the existing ones as example. Next, users can run those targets from ggc-ispc/eval/Makefile.general by simply appending the desired targets to ALL_CONFIGS.
2) Number of Threads and Pinning Policy: This can be changed in our evaluation Makefile by setting TASK variable in ggc-ispc/eval/Makefile.general. Default value is 0-0, meaning our EGACS will determine number of threads automatically. There are two fields in this variable: the first field sets total number of worker threads. The second field sets the distance between two logical CPU IDs on which two consecutive worker threads are pinned. For example, 2-1 instructs EGACS to create 2 worker threads: thread #0 is pinned on CPU #0, thread #1 on CPU #1. 4-2 instructs EGACS to create 4 worker threads: thread #0 is pinned on CPU #0, thread #1 on CPU #2, thread #2 on CPU #1, and thread #3 on CPU #3. Correctly setting the value of this variable is important for scalability results (Makefile.thread).