------------------------------------------------------------------------
INTRODUCTION     
------------------------------------------------------------------------

This artifact contains the benchmarks for the loop fission algorithm
presented in paper "Distributing and Parallelizing Non-canonical Loops".

This is a curated benchmark suite. The included benchmarks have been
selected specifically for offering opportunity to perform loop fission
transformation. For each benchmark, we measure difference in clock time,
after loop fission and parallelization using OpenMP.

We compare the original baseline programs to those generated by our loop
fission technique. We also compare our technique to an alternative loop
fission technique implemented in ROSE Compiler.

------------------------------------------------------------------------
ABOUT THE ARTIFACT    
------------------------------------------------------------------------

The artifact is a self-container docker image. Any system with docker
installation should be able to run this artifact. However, it was built
on linux/amd64 for linux/amd64 platform, and may perform sub-optimally
on different platforms.

We provide two evaluation strategies: small and full evaluation. These
differ in scope and duration. Small evaluation performs partial
evaluation and takes 10 minutes. It is intended for the test phase to
check that artifact is functional. The full evaluation runs all
benchmarks for all options, and provides the basis for claims in the
paper. Full evaluation and its duration depends on the number of
available cores, but takes approximately 3-4 hours.

We expect our results to be reproducible on common hardware and
replication of results should be possible on laptops. 

Minimum system requirements:

* Docker <https://www.docker.com>
* 4 cores for parallelism
* 8 GB memory
* A linux/amd64 platform is recommended for accurate results

------------------------------------------------------------------------
GETTING STARTED   
------------------------------------------------------------------------

1. Validate the integrity of the container. Generate a checksum using 
   the following command-line tools.

   Linux:    sha256sum loop-fission.tar.gz
   Windows:  CertUtil -hashfile loop-fission.tar.gz SHA256
   MacOS:    shasum -a 256 loop-fission.tar.gz

   The checksum should match:
   605c801b41a9938e35d6b8592200916baece2c56c2f54dbe85f0f1bb36072077
   
2. Load the artifact image:

   ```
   docker image load -i loop-fission.tar.gz
   ```
   
   Expected output:  Loaded image: loop-fission:latest

3. Create a local directory, for storing benchmark results and plots.

   ```
   mkdir artifact_eval
   ```

   These directories will contain the captured results and will persist
   after the container exits.

4. Start the container:

   ```
   docker run -v $(pwd)/artifact_eval:/loop-fission/eval -it --rm loop-fission /bin/bash
   ```

   Expected output:  root@xxxxxxxxxxxx:/loop-fission# 

------------------------------------------------------------------------
CONTENT OVERVIEW
------------------------------------------------------------------------

After the interactive shell has launched, you may explore the artifact
content. The artifact content is organized as follows.

BENCHMARK DIRECTORIES  

These contain the benchmarks we will evaluate. They are organized in
different directories based on applicable transformation (if any).
All directories contain same number of files with matching filenames.
We compare, e.g., original/3mm.c  -vs-  fission/3mm.c  -vs-  alt/3mm.c.

| Directory | Description                                            |
|-----------|--------------------------------------------------------|
| original  | (baseline) unmodified benchmarks                       | 
| fission   | transformed and parallelized, using our method         |
| alt       | transformed and parallelized, using alternative method |

OTHER RELEVANT DIRECTORIES AND FILES  

Other relevant files and directories are used to perform the timing
and to plot the results. 

| File/Dir    | Description                                         |
|-------------|-----------------------------------------------------|
| eval        | evaluation results will be stored here              |
| headers     | header files for benchmarks                         |
| plot.py     | generates tables and plots from results             |
| ref_eval    | referential result                                  |
| utilities   | helpers and e.g. timing script implementation       |
| run.sh      | wrapper for timing script, to benchmark directories |
| LICENSE.txt | software license                                    |

------------------------------------------------------------------------
PAPER CLAIMS
------------------------------------------------------------------------

The paper makes the following claims that can be replicated with the 
artifact:

1. Parallelizing while loops results in appreciable gain, upper-bounded 
   by the number of parallelizable loops produced by loop fission.

2. For other loops, it yields comparable results, to other automatic
   loop transformation tools, in speedup potential

In addition, following paper figures and tables are reproducible
with the artifact:

* Figure 6 and Table 3 can be reproduced using the artifact after
  running the full evaluation, as described below. 

* Figure 5 is a simplified version of original/ -vs- fission/  bicg.c

------------------------------------------------------------------------
SMALL EVALUATION (10 MINUTES)
------------------------------------------------------------------------

The intended use case for small evaluation is to check that artifact
functions as expected. Evaluation proceeds in two phases: first capture
the timing results, then generate plots and tables from those results.

Small evaluation is performed on a subset of data sizes (SMALL, MEDIUM,
LARGE), all optimization levels (-O0 to -O3), for 9 benchmarks. The 
expected duration is about 10 minutes.

1. Measure running times:

   ```
   make small
   ```
   
   The expected output will look similar to the following:

   ✓ (11:15:11) done with (SMALL, -O0, original): bicg                                                          
   ✓ (11:15:12) done with (SMALL, -O0, original): colormap                                                      
   ✓ (11:15:12) done with (SMALL, -O0, original): conjgrad                                                      
   ✓ (11:15:12) done with (SMALL, -O0, original): deriche                                                       
   ✓ (11:15:13) done with (SMALL, -O0, original): fdtd-2d                                                       
   ✓ (11:15:13) done with (SMALL, -O0, original): gesummv  
   ⋮
   
   The first output column (NN:NN:NN) indicates time of completion
   for a particular benchmark and helps assess remaining time to
   completion.

2. Plot results:    

   ```
   make plots
   ```

   The result will be stored in plots directory (artifact_eval/plots
   on host) for observation. If you want to review timing results in
   the container terminal, you can also run:

   ```
   python3 plot.py -d speedup -f md --digits 2  --show
   ```

   Note that these results are not complete, thus not fully comparable 
   to paper results. They can be partially compared to Fig 6. and 
   Table 3, for matching data sizes and benchmarks.

------------------------------------------------------------------------
FULL EVALUATION (2 HOURS)
------------------------------------------------------------------------

The full evaluation is performed similarly. Evaluation proceeds in two
phases: first capture the timing results, then generate plots and 
tables from those results.

Full evaluation is performed on all data sizes (SMALL to EXTRALARGE), 
all optimization levels (-O0 to -O3), for all benchmarks. The expected 
duration is about 2 hours on a 4-core linux/amd64 host. If you are
running a different host or architecture, this timing estimate may be
inaccurate based on our experiments.

1. Measure running times:

   ```
   make all
   ```
   
   This command will first time original directory, then those in
   fission directory, and lastly alt directory. Wait for this timing
   phase to run to completion.

2. Plot results:

   ```
   make plots
   ```

3. To generate Figure 6:

   ```
   python3 plot.py -d speedup -f plot --prog_filter bicg,gesummv,mvt --dir_filter original,fission
   ```

4. To generate Table 3:

   ```
   python3 plot.py -d speedup -f md --digits 2  --show
   ```

The results are expected to differ slightly due to containerization and
machine details, but the general claims of paper, as they are stated
earlier, should still hold.

The plotting options are highly customizable. To generate additional
plots, review available plotting arguments: `python3 plot.py --help`.

------------------------------------------------------------------------
EVALUATION OUTSIDE CONTAINER
------------------------------------------------------------------------

The containerization adds some overhead and can skew timing results.
To avoid this, run benchmarks on host. The easiest approach is to:

1. Clone benchmarks from source 

   ```
   git clone https://github.com/statycc/loop-fission.git
   ```

2. Make sure host includes compiler with OpenMP support, make, 
   Python 3.8 or higher.

3. Install Python dependencies needed for plotting: 

   ```
   python3 -m pip install -q -r requirements.txt
   ```

4. Repeat evaluation, small or full, and plot results.

------------------------------------------------------------------------
FUTURE REUSE AND EXTENSIONS
------------------------------------------------------------------------

This artifact is a collection of benchmarks, and is suitable to
extension and reuse by modifying/adding benchmarks. That is, it is
possible to add (or remove) benchmarks, and repeat the timing to obtain
results.

The artifact extends the timing utilities obtained from PolyBench/C
<http://web.cse.ohio-state.edu/~pouchet.2/software/polybench/>,
version 4.2 <https://sourceforge.net/projects/polybench/files/>, and
is compatible with all original PolyBench/C benchmarks or any program
that follows the PB/C template. The template is provided, for 
convenience in utilities/template-for-new-benchmark.c.

New benchmarks can be created as follows: 

1. Add non-transformed benchmark in original directory.
2. Add its .h file in headers directory.
3. Add the new benchmark, after transformation and parallelization as
   described in paper, in fission directory.
4. Add comparative transformation output in alt directory.

Alternative comparison targets can be evaluated by performing
similarly transformation and parallelization from original benchmark 
to an alternative comparison target, then repeating the timing.

Note that we provide detailed instructions to help transform benchmarks
with ROSE compiler, as described in utilities/readme.md, and the
automated script utilities/rose.sh. Reproducing those transformations
requires building ROSE from source (non-trivial), and is outside the
scope of this artifact. In this artifact, we assume all transformations
have been applied, and the concern is timing their performance
post-transformation.