SparseAuto

Introduction

We claim the results for figures 8-12 are replicable (under reasonable expectations of performance variation).

Hardware Dependencies

We conducted the experiments on a machine with four Non-Uniform Memory Access (NUMA) nodes of Intel(R) Xeon(R) CPU E5-4650 8-core processor (32-cores in total), operating at 2.70 GHz, with 32KB L1 data cache, 256KB L2 cache per core, and 80MB LLC shared between 4 NUMA nodes. The machine had 190GB of RAM and 8GB of swap space.

The artifact expects the machine to have 32 cores at least. Schedule generation requires ~120 GB of RAM. However, we included the final schedules inside the test_schedules directory so that the user does not have to have that much RAM. If you are doing the full schedule generation, you will need an extra 10 GB of memory. Since the tensors take about 10 GB of memory, the execution of the artifact would require at least 20 GB of disk space.

Software Dependencies

Linux with cmake, gcc, g++, python3, python3-pip, python3-venv, wget, libomp-dev, zip, and unzip. Python packages z3, z3-solver, matplotlib, numpy, pandas, pillow, regex, and seaborn.

Quick-start guide (Kick-the-tires phase)

Most of the scripts will need to be run from the directory tensor-schedules.

Execution with docker container

The artifact is bundled as an OCI container created with Docker (Dockerfile is available as a part of the tarball on Zenodo). The Docker image is tarred as sparseauto-docker-image.tar.

The image can be added to the local Docker store as follows:

docker load --input sparseauto-docker-image.tar

For your convenience, we have also included a Dockerfile so that the reviewer can build the Docker image themselves. (you need to be inside the oopsla24-artefact)

docker image build -t sparseauto-oopsla -f Dockerfile .

In case you are running the docker image on an ARM-based Mac, you can use the command in dockerhelp.md

Once you get the image, start the session as follows:

docker run --name sparseauto-oopsla-instance -td sparseauto-oopsla

Then, log into the container:

docker exec -it sparseauto-oopsla-instance bash

Once you log into the docker container, you should be inside /home/oopsla/tensor-schedules directory. If you prefer to not run docker, please refer to the Step-by-step Guide/Building from the source section.

Quickly test you can run the software

From inside the tensor-schedules directory, execute the below commands to confirm that everything is working properly.

# download matrices/tensors - these are already pre-bundled with the docker 
#image/source artifact. Confirm all the tensors are downloded by running this script.
./download_tensors.sh

# build the TACO/SparseLNR project
./build_taco.sh

Execute a simple script to generate a plot

./kick_the_tires.sh

Once the execution of the kick_the_tires.sh finishes, there should be a plot inside tensor-schedules/plots/fig8/plot3.png. It should take less than 30 minutes to finish the execution of the script. This plot contains a subset of the tensors for the first subplot in Fig. 8 in the paper.

If you're inside the docker container, you can copy the image to your local machine to view it using the below command.

docker cp sparseauto-oopsla-instance:/home/oopsla/tensor-schedules/plots/fig8/test3.png ~/<directory you want the image to be saved>/

Step-by-step Guide

After getting started guide, from inside the tensor-schedules directory execute the scripts below to obtain subgraphs of the figures 8-12 in the paper.

./figure8.sh
./figure9.sh
./figure10.sh
./figure11.sh
./figure12.sh

Due to time constraints, we have reduced the number of iterations in some of the tests. For example, we have used 32 iterations in figure8 generation originally. But for this artifact, we have reduced the number of iterations to 4 so that the reviewer does not have to wait too long to generate the plots. The number of iterations can be changed by passing the argument --iterations <Number of iterations> in the figure to the src.main_run_test_modified python script call.

Results

figure8.sh - Fig. 8 of the paper will be saved in tensor-schedules/plots/fig8 directory with the name plots-combined
figure9-12.sh - Subplots of Fig. 9-12 of the paper will be saved to tensor-schedules/plots/fig9-12 directories.

The data points used to generate the plots are saved in a directory called tensor-schedules/csv_results but we do not expect the reviewers to read them as the generated plots are directly saved to the tensor-schedules/plots directory.

Fixing compile errors

Testing is done by generating the schedule as a string and then replacing it in the file sparseLNR/test/test-workspaces.cpp file. If a process hangs, it may lead to compilation errors. In this scenario, we have included a fixme.sh script to copy a fresh file in tensor-schedules/fixme/test-workspaces.cpp to sparseLNR/test/test-workspaces.cpp.

Execution times

figure8.sh <2.5 days
figure9.sh <1.5 days
figure10.sh <1.5 days
figure11.sh <1.5 days
figure12.sh <1.5 days

We expect the user can finish all testing within less than a week.

Building from the source

You can use the source directly as it may give faster execution times. We assume that the user has access to a Linux/Ubuntu-based machine.

Please install the below packages (You can look at the Dockerfile for exact steps)

# install packages
sudo apt-get -y update && sudo apt-get -y install cmake gcc g++ python3 python3-pip python3-venv git wget libomp-dev zip unzip

# from inside the oopsla24-artefact, create a python virtual environment and 
# install python modules
pip install --no-cache-dir --upgrade pip 
pip install --no-cache-dir virtualenv
python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install --no-cache-dir -r ./tensor-schedules/requirements.txt

Now move inside the tensor-schedules directory

Reusability Guide

The artifact contains two main directories. The first one sparseLNR contains the code generation section.

We have added tests to sparseLNR/test/test-workspaces.cpp. Should the user intends to add their tests, they should follow one of the tests in that file. More information on writing TACO code can be found on TACO website and the Documentation page.

The schedule generation framework is included in the directory called tensor-schedules. The generated schedule is compiled into an equivalent SparseAuto schedule string. We paste this schedule string to the sparseLNR/test/test-workspaces.cpp file between /* BEGIN / / END */ tags.

The schedule generation utilizes a config file where the user is required to provide a config file containing all the constraints. Example config files are included in the tensor-schedules/test_configs directory.

Please refer to tensor-schedules/test_configs/test3_config_kick_the_tires.json for an example config and tensor-schedules/kick_the_tires.sh for the execution commands. The corresponding test of test3_config_kick_the_tires.json is attributed to the "test_name" attribute in the JSON file, named sddmm_spmm_real.

{
  "accesses": { # A(i,l) = B(i,j) * C(i,k) * D(j,k) * E(j,l)
    "A": ["i", "l"],
    "B": ["i", "j"],
    "C": ["i", "k"],
    "D": ["j", "k"],
    "E": ["j", "l"]
  },
  "tensor_idx_order_constraints": {
    "B": [ # B: B(i,j) here is sparse, therefore j should appear after i
      ["j", "i"]
    ]
  },
  "output_tensor": "A", # output tensor in the computation is A
  # defines files to save the configs
  "test_json_file": "test3_without_z3_pruning.json",
  "test_json_file_without_depth": "test3_without_depth_pruning.json",
  "test_json_file_after_z3": "test3_with_z3_pruning.json",
  "test_best_schedule_file": "test3_best_schedule.json",
  # corresponding test name in sparseLNR/test/tests-workspaces.cpp file
  # corresponding default test name is default_sddmm_spmm_real
  "test_name": "sddmm_spmm_real",
  # definintion of z3 constraints for the filter stage 3
  "z3_constraints": [
    "i >= 11000", "i <= 1000000",
    "j >= 11000", "j <= 1000000",
    "k >= 8", "k <= 256",
    "l >= 8", "l <= 256",
    "jpos >= 0", "jpos <= j",
    "1000 * i * jpos < i * j",
    "i * j < 1000000 * i * jpos"
  ],
  # timing values after comparing against default TACO schedule
  "output_csv_file": "test3.csv",
  # matrices to evaluate on
  "eval_files": ["bcsstk17.mtx", "cant.mtx", "consph.mtx", "cop20k_A.mtx"],
  # actual bounds of the tensors for runtime pruning stages
  "actual_values": {
    "bcsstk17.mtx": {"i": 11000, "j": 11000, "k": 16, "l": 16, "jpos": 39},
    "cant.mtx": {"i": 62000, "j": 62000, "k": 16, "l": 16, "jpos": 65},
    "consph.mtx": {"i": 83000, "j": 83000, "k": 16, "l": 16, "jpos": 72},
    "cop20k_A.mtx": {"i": 12000, "j": 12000, "k": 16, "l": 16, "jpos": 218}
  }
}

There is a corresponding test in sparseLNR/test/tests-workspaces.cpp with a name bearing sddmm_spmm_real. The below template describes the basic functionality of the test included in tests-workspaces.cpp.

TEST(workspaces, <test name>) {
  [variable declarations]
  [load tensor file for reading]
  [tensor declarations and packing]
  [index declarations]
  [computation declaration]
  ...
  /* BEGIN <test name> */
  ...
  /* END <test name> */
  [extra transformations]
  ...
  [declare expected (no transformations)]
  [declare timing variables]
  for (int [var] = 0; [var] < [any integer]; [var]++) {
    [time the computation]
    [time computation without transformations]
    ...
  }
  ...
}

Should the user change the schedule generation algorithm, they should look at the get_schedules_unfused and other functions in tensor-schedules/src/autosched.py

Depth-based pruning logic is implemented in src/prune.py. POSET-based pruning logic is implemented in src/solver_config.py. Z3 based pruning logic is also implemented in prune_baskets function in src/solver_config.py. The user can either choose to change these functions or add new pruning stages following the same implementation logic.

When we save schedules after the compile time stages, we allocate them to baskets where a basket holds schedules with the same time and memory complexities. Runtime pruning logic is implemented in src/basket.py.

The code generation code is implemented in src/generate_taco_schedule.py