There is a newer version of this record available.

Software Open Access

From Loop Fusion to Kernel Fusion: A Domain-specific Approach to Locality Optimization

Bo Qiao; Oliver Reiche; Frank Hannig; Jürgen Teich

This artifact describes the steps to reproduce the results for the CUDA code generation with kernel fusion in Hipacc (an image processing DSL and source-to-source compiler embedded in C++), as presented in the CGO19 paper "From Loop Fusion to Kernel Fusion: A Domain-specific Approach to Locality Optimization". We provide the original binaries as well as the source code to regenerate the binaries, which can be executed on x86_64 Linux system with CUDA enabled GPUs. Furthermore, we include two python scripts to run the application and compute the statistics as depicted in Figure 6 in the paper.

Hardware Dependencies: CUDA enabled GPUs are required. We used three Nvidia cards, as discussed in Section 5.1 in the paper: (a) Geforce GTX 745 facilitates 384 CUDA cores with a base clock of 1,033 MHz and 900 MHz memory clock. (b) Geforce GTX 680 has 1,536 CUDA cores with a base clock of 1,058 MHz and 3,004 MHz memory clock. (c) Tesla K20c has 2,496 CUDA cores with a base clock of 706 MHz and 2,600 MHz memory clock. For all three GPUs, the total amount of shared memory per block is 48 Kbytes, the total number of registers available per block is 65,536. GPUs with similar configurations are expected to generate comparable results.

Software Dependencies: Clang/LLVM (6.0), compiler_rt and libcxx for Linux (6.0). CMake (3.4 or later), Git (2.7 or later). Nvidia CUDA Driver (9.0 or later). OpenCV for producing visual output in the samples.

Files (49.0 MB)
Name Size
49.0 MB Download
All versions This version
Views 341305
Downloads 3023
Data volume 1.2 GB1.1 GB
Unique views 307285
Unique downloads 2621


Cite as