------------------------------------------------------------------------ INTRODUCTION ------------------------------------------------------------------------ This artifact contains the benchmarks for the loop fission algorithm presented in paper "Distributing and Parallelizing Non-canonical Loops". This is a curated benchmark suite. The included benchmarks have been selected specifically for offering opportunity to perform loop fission transformation. For each benchmark, we measure difference in clock time, after loop fission and parallelization using OpenMP. We compare the original baseline programs to those generated by our loop fission technique. We also compare our technique to an alternative loop fission technique implemented in ROSE Compiler. ------------------------------------------------------------------------ ABOUT THE ARTIFACT ------------------------------------------------------------------------ The artifact is a self-container docker image. Any system with docker installation should be able to run this artifact. However, it was built on linux/amd64 for linux/amd64 platform, and may perform sub-optimally on different platforms. We provide two evaluation strategies: small and full evaluation. These differ in scope and duration. Small evaluation performs partial evaluation and takes 10 minutes. It is intended for the test phase to check that artifact is functional. The full evaluation runs all benchmarks for all options, and provides the basis for claims in the paper. Full evaluation and its duration depends on the number of available cores, but takes approximately 3-4 hours. We expect our results to be reproducible on common hardware and replication of results should be possible on laptops. Minimum system requirements: * Docker * 4 cores for parallelism * 8 GB memory * A linux/amd64 platform is recommended for accurate results ------------------------------------------------------------------------ GETTING STARTED ------------------------------------------------------------------------ 1. Validate the integrity of the container. Generate a checksum using the following command-line tools. Linux: sha256sum loop-fission.tar.gz Windows: CertUtil -hashfile loop-fission.tar.gz SHA256 MacOS: shasum -a 256 loop-fission.tar.gz The checksum should match: 605c801b41a9938e35d6b8592200916baece2c56c2f54dbe85f0f1bb36072077 2. Load the artifact image: ``` docker image load -i loop-fission.tar.gz ``` Expected output: Loaded image: loop-fission:latest 3. Create a local directory, for storing benchmark results and plots. ``` mkdir artifact_eval ``` These directories will contain the captured results and will persist after the container exits. 4. Start the container: ``` docker run -v $(pwd)/artifact_eval:/loop-fission/eval -it --rm loop-fission /bin/bash ``` Expected output: root@xxxxxxxxxxxx:/loop-fission# ------------------------------------------------------------------------ CONTENT OVERVIEW ------------------------------------------------------------------------ After the interactive shell has launched, you may explore the artifact content. The artifact content is organized as follows. BENCHMARK DIRECTORIES These contain the benchmarks we will evaluate. They are organized in different directories based on applicable transformation (if any). All directories contain same number of files with matching filenames. We compare, e.g., original/3mm.c -vs- fission/3mm.c -vs- alt/3mm.c. | Directory | Description | |-----------|--------------------------------------------------------| | original | (baseline) unmodified benchmarks | | fission | transformed and parallelized, using our method | | alt | transformed and parallelized, using alternative method | OTHER RELEVANT DIRECTORIES AND FILES Other relevant files and directories are used to perform the timing and to plot the results. | File/Dir | Description | |-------------|-----------------------------------------------------| | eval | evaluation results will be stored here | | headers | header files for benchmarks | | plot.py | generates tables and plots from results | | ref_eval | referential result | | utilities | helpers and e.g. timing script implementation | | run.sh | wrapper for timing script, to benchmark directories | | LICENSE.txt | software license | ------------------------------------------------------------------------ PAPER CLAIMS ------------------------------------------------------------------------ The paper makes the following claims that can be replicated with the artifact: 1. Parallelizing while loops results in appreciable gain, upper-bounded by the number of parallelizable loops produced by loop fission. 2. For other loops, it yields comparable results, to other automatic loop transformation tools, in speedup potential In addition, following paper figures and tables are reproducible with the artifact: * Figure 6 and Table 3 can be reproduced using the artifact after running the full evaluation, as described below. * Figure 5 is a simplified version of original/ -vs- fission/ bicg.c ------------------------------------------------------------------------ SMALL EVALUATION (10 MINUTES) ------------------------------------------------------------------------ The intended use case for small evaluation is to check that artifact functions as expected. Evaluation proceeds in two phases: first capture the timing results, then generate plots and tables from those results. Small evaluation is performed on a subset of data sizes (SMALL, MEDIUM, LARGE), all optimization levels (-O0 to -O3), for 9 benchmarks. The expected duration is about 10 minutes. 1. Measure running times: ``` make small ``` The expected output will look similar to the following: ✓ (11:15:11) done with (SMALL, -O0, original): bicg ✓ (11:15:12) done with (SMALL, -O0, original): colormap ✓ (11:15:12) done with (SMALL, -O0, original): conjgrad ✓ (11:15:12) done with (SMALL, -O0, original): deriche ✓ (11:15:13) done with (SMALL, -O0, original): fdtd-2d ✓ (11:15:13) done with (SMALL, -O0, original): gesummv ⋮ The first output column (NN:NN:NN) indicates time of completion for a particular benchmark and helps assess remaining time to completion. 2. Plot results: ``` make plots ``` The result will be stored in plots directory (artifact_eval/plots on host) for observation. If you want to review timing results in the container terminal, you can also run: ``` python3 plot.py -d speedup -f md --digits 2 --show ``` Note that these results are not complete, thus not fully comparable to paper results. They can be partially compared to Fig 6. and Table 3, for matching data sizes and benchmarks. ------------------------------------------------------------------------ FULL EVALUATION (2 HOURS) ------------------------------------------------------------------------ The full evaluation is performed similarly. Evaluation proceeds in two phases: first capture the timing results, then generate plots and tables from those results. Full evaluation is performed on all data sizes (SMALL to EXTRALARGE), all optimization levels (-O0 to -O3), for all benchmarks. The expected duration is about 2 hours on a 4-core linux/amd64 host. If you are running a different host or architecture, this timing estimate may be inaccurate based on our experiments. 1. Measure running times: ``` make all ``` This command will first time original directory, then those in fission directory, and lastly alt directory. Wait for this timing phase to run to completion. 2. Plot results: ``` make plots ``` 3. To generate Figure 6: ``` python3 plot.py -d speedup -f plot --prog_filter bicg,gesummv,mvt --dir_filter original,fission ``` 4. To generate Table 3: ``` python3 plot.py -d speedup -f md --digits 2 --show ``` The results are expected to differ slightly due to containerization and machine details, but the general claims of paper, as they are stated earlier, should still hold. The plotting options are highly customizable. To generate additional plots, review available plotting arguments: `python3 plot.py --help`. ------------------------------------------------------------------------ EVALUATION OUTSIDE CONTAINER ------------------------------------------------------------------------ The containerization adds some overhead and can skew timing results. To avoid this, run benchmarks on host. The easiest approach is to: 1. Clone benchmarks from source ``` git clone https://github.com/statycc/loop-fission.git ``` 2. Make sure host includes compiler with OpenMP support, make, Python 3.8 or higher. 3. Install Python dependencies needed for plotting: ``` python3 -m pip install -q -r requirements.txt ``` 4. Repeat evaluation, small or full, and plot results. ------------------------------------------------------------------------ FUTURE REUSE AND EXTENSIONS ------------------------------------------------------------------------ This artifact is a collection of benchmarks, and is suitable to extension and reuse by modifying/adding benchmarks. That is, it is possible to add (or remove) benchmarks, and repeat the timing to obtain results. The artifact extends the timing utilities obtained from PolyBench/C , version 4.2 , and is compatible with all original PolyBench/C benchmarks or any program that follows the PB/C template. The template is provided, for convenience in utilities/template-for-new-benchmark.c. New benchmarks can be created as follows: 1. Add non-transformed benchmark in original directory. 2. Add its .h file in headers directory. 3. Add the new benchmark, after transformation and parallelization as described in paper, in fission directory. 4. Add comparative transformation output in alt directory. Alternative comparison targets can be evaluated by performing similarly transformation and parallelization from original benchmark to an alternative comparison target, then repeating the timing. Note that we provide detailed instructions to help transform benchmarks with ROSE compiler, as described in utilities/readme.md, and the automated script utilities/rose.sh. Reproducing those transformations requires building ROSE from source (non-trivial), and is outside the scope of this artifact. In this artifact, we assume all transformations have been applied, and the concern is timing their performance post-transformation.