Checking Data-Race Freedom of GPU Kernels, Compositionally

CAV’21 Artifact

by Tiago Cogumbreiro, Julien Lange, Dennis Liew Zhen Rong, and Hannah Zicarelli

This paper introduces Faial, a tool that guarantees data-race freedom (DRF) for CUDA kernels.

The artifact submission contains:

This document as both Markdown and HTML format, and
a Docker image, and the Dockerfile used to build it.

The structure of this document:

Setting up the Container
Proofs
Reproducing Experimental Results
Accessing Experimental Results
Kernel Generation Framework
Rebuilding this Artifact
1. Building with Docker
2. Building Faial
Tutorial on Faial Usage

The structure of this container (also available on GitLab):

/benchmark/: Scripts and templates comprising the benchmarking framework.
/datasets/: Datasets, scripts and experimental evaluation results.
- correctness/: Data for Claim 1.
- micro-benchmarks/: Data for Claim 2.
- gpuverify-cav14/: Data for Claim 3.
/faial-coq/: Mechanized proofs for all theoretical results.
/source/: Source code for the Faial verification pipeline.
/tools/: Binaries (ubuntu-amd64) for all verification tools used.

1. Setting up the Container

First, install and start Docker.

The Docker image is available as both as a compressed tar archive and also online. Choose one of the two following methods. In both cases, the container will load an interactive terminal session at directory /artifact; the environment variable FAIAL_HOME pointsto this location.

Loading Docker from a tar archive

Ensure you are in the root of this artifact. You should see the compressed tar archive artifact-354.tar.bz2.
To load the Docker container from this archive, run:
```
$ docker load < artifact-354.tar.bz2
```
To enter the container with an interactive terminal session, run:
```
$ docker run -it -p 8000:8000 faial-cav21
```

Loading Docker from online

To download the image from the web, run:

$ docker pull registry.gitlab.com/umb-svl/faial-artifact-cav21/artifact:latest

To enter the container with an interactive terminal session, run:

$ docker run -it -p 8000:8000 registry.gitlab.com/umb-svl/faial-artifact-cav21/artifact:latest

2. Proofs

Mechanized proofs supporting theoretical results are available locally at faial-coq/ and online at GitLab.

To check proofs run make:

$ cd $FAIAL_HOME/faial-coq
$ make clean # Clean any already compiled proofs (optional)
$ make       # Check if all proofs are compiled

File _CoqProject lists all files that will be compiled, thus their proofs will be checked. Below we list the file, line number, and the name of the definition/theorem, e.g., Main.v:619 theorem drf corresponds to file faial-coq/src/Main.v, line number 619, and theorem drf. For your convenience, we also provide a hyperlink to the file in our GitLab repository (branch cav21).

Results

Theorem 1: Main.v:619 theorem drf
Theorem 2: Compositionality.v:472 corollary compositionality
Theorem 3: Main.v:619 theorem drf

Figure 2

Numeric expressions: NExp.v:18 inductive nexp
Boolean expressions: BExp.v:14 inductive bexp
Syntax of unsynchronized protocols: ULang.v:24 inductive inst
Accesses: AExp.v:9 Access
Phase $P$: list access_val
History $H$: list list access_val
History concatenation: VHist.v:52 fixpoint v_app
History serialization: VHist.v:46 fixpoint v_seq
Big-step semantics ↓ for $\mathcal U$: ULang.v:24 inductive Run
Well-formed protocols: WLang.v:41 inductive w_inst; see example in Main.v:699 definition Prog1
Big-step semantics ↓ for $\mathcal W$:WLang.v:104 inductive WRun
Data-race: fun x y => not (AExp.access_safe x y)
Safe: Hist.v:20 definition Safe
Safe history: Hist.v:25 definition MSafeStrong

Figure 3

Aligned protocols: ALang.v:25 inductive n_inst
Sequencing protocols: ALang.v:53 fixpoint n_seq, and ALang.v:60 definition p_seq
Aligning: Align.v:16 fixpoint align; see example in Main.v:707 definition AProg1

Figure 4

Syntax: TLang.v:25 inductive inst
Product of histories ⊗: Util.v:190 definition prod
Big-step semantics ⇓: TLang.v:50 inductive Run
Projection: Sequentialize.v:32 definition trace
Splitting: Main.v:20 definition split; see example in Main.v:726 definition SProg1

Noteworthy differences between the paper and the Coq mechanization

We prove a stronger result in Main.v theorem drf than Theorems 1 and 3 (in the paper) that establishes that split(align(P)) is DRF iff P is DRF.
The language of Unsynchronized protocols in ULang.v is more expressive than the language of protocols defined in Fig 2. The former includes conditionals, while the latter does not.
The cornerstone of our proof technique is reasoning about the set of concurrent pairs of accesses produced by a protocol while executing it (IPairIn) --- there is one such relation per language, ie ULang, WLang, and ALang. IPairIn abstracts away the finer details of each operational semantics, greatly simplifying the proof. To this end, in our Coq formalism we only give the semantics of aligned protocols in terms of IPairIn. Additionally, in the Coq formalism, we only give the operational semantics of well-formed protocols $\mathcal W$, and of symbolic traces $\mathcal T$.
In the Coq formalism, barrier splitting is performed in two stages, first we translate an aligned program ALang.n_inst into a list of PhaseSplit.phase and then each PhaseSplit.phase is converted into a symbolic trace TLang.inst. We give a full example of its usage in Main.v:707 definition AProg1.

OCaml implementation

Function align (Figure 3): source/faial/src/phasealign.ml
Function split (Figure 4): source/faial/src/phasesplit.ml

3. Reproducing Experimental Results

This section contains instructions on generating the data used in the paper.

The CSV data, logs, and plots used in the paper are already included in each of the respective directories. Rerunning the experiment will overwrite the results.

To visualise the generated data, the Docker container includes a HTTP server exposing $FAIAL_HOME to port 8000. To access the data, ensure the container is running, and open the following URL in your favourite browser (on the host machine): localhost:8000.

See Accessing Experimental Results for more details.

Warning: Rerunning the experiment will overwrite the bundled logs/figures that support the paper with your own logs/figures! Reverting to the original logs/figures is possible via a backup copy of:

Claim 1: datasets/correctness/results
Claim 2: datasets/micro-benchmarks/results
Claim 3: datasets/gpuverify-cav14/results

3.1 Claim 1: Correctness

Expected runtime of this experiment: ~20 minutes.

This section details our experimental dataset, results, and procedure related to Table 1 in Claim 1: Correctness. This experiment requires manual processing! While we provide scripts to generate data, verifying the correctness of data requires manual examination.

All files relating to Claim 1: Correctness are stored in the datasets/correctness directory.
```
$ cd $FAIAL_HOME/datasets/correctness
```
The dataset for Table 1 is split into Tests 1-5. Test 1 (one test per tool) can be found in directory {TOOL}/real-world/transposeDiagonal.cu, eg faial/real-world/transposeDiagonal.cu. Tests 2-5 can be found in directory {TOOL}/synthetic/{TEST}.cu (one test per tool), example gklee/synthetic/last-iter.cu. Each test has a DRF version and a racy version, which are distinguishable by the filename. For instance, {TOOL}/synthetic/last-iter-drf.cu is DRF and {TOOL}/synthetic/last-iter.cu is racy.

Automatic scripts are provided to to rerun the tools against the dataset:
```
$ python3 run.py --tool faial      # runtime: ~5s
$ python3 run.py --tool gpuverify  # runtime: ~50s
$ python3 run.py --tool pug        # runtime: ~3s
$ python3 run.py --tool gklee      # runtime: ~7m  /!\ WARNING THIS MAY CRASH DUE TO GKLEE
$ python3 run.py --tool sesa       # runtime: ~12m /!\ WARNING THIS MAY CRASH DUE TO SESA
```
The above commands will generate logs and a timings-{TOOL}.csv for each tool. This data contains tool exit statuses, time and memory, paths to tool-specific kernels, paths to tool logs, and DRF or racy results from parsing logs.

A script is provided to generate a table with the data generated above:

$ python3 table.py

This table shows the results from the timing CSVs in a prettier format. The following is the output of the table script we observed in our experiment.

example                   expected    faial    gpuverify    pug    gklee           sesa
------------------------  ----------  -------  -----------  -----  --------------  --------------
transposeDiagonal         racy        racy     racy         drf    timeout         timeout
transposeDiagonal-drf     drf         drf      racy         drf    timeout         timeout
first-iter                racy        racy     racy         racy   timeout         timeout
first-iter-drf            drf         drf      racy         racy   timeout         timeout
last-iter                 racy        racy     racy         racy   timeout         timeout
last-iter-drf             drf         drf      racy         drf    timeout         timeout
last-iter-first-iter      racy        racy     racy         racy   timeout         timeout
last-iter-first-iter-drf  drf         drf      racy         racy   timeout         timeout
read-index-racy           racy        racy     racy         racy   no race alarms  no race alarms
read-index                drf         racy     drf          racy   no race alarms  no race alarms

Note that while this table displays some information wrt. racy-ness, the validity of these results needs to be validated manually, as we explain below.

To count and verify the correctness of data-races, logs must be manually examined for each racy result. The objective of this manual analysis is to count the number of data-races reported and determine if the error traces raised by tools accurately reflect real data-races. All information provided related to the race is considered, e.g., state of local and global variables, types of accesses (read/write), source code line numbers of accesses.

For DRF test components, it is only necessary to count the reported races as they can be assumed invalid. For racy test components, it is additionally necessary to verify the correctness of each data-race.

We include a file with this analysis for each racy tool log in our results. Each analysis file is a .txt file corresponding to the .log file with tool output. For example, a data-race is reported in for Faial in faial/synthetic/read-index-racy-1.log, and we provide an analysis of this race in faial/synthetic/read-index-racy-1.txt.

To verify data-races in tool logs, a working understanding of the data-races in question is helpful. The paper provides context for these races through respective access memory protocols:
- Test 1 is a running-example in Section 1 and a simplified protocol is shown in Listing 2.3. Additionally, Appendix A Examples 3 and 5 show tool analysis of this test.
- Tests 2-5 are discussed in Claim: Correctness in Section 6. Protocols for these tests are shown in Figures 5-6.

3.2 Claim 2: Scalability

Expected runtime of this experiment: ~1 hour (with --repeat 1) or ~5 hours (with --repeat 5).

This section details our experimental dataset, results, and procedure related to Figure 8 in Claim 2: Scalability.

All files relating to Claim 2: Scalability are stored in the micro-benchmarks directory.
```
$ cd $FAIAL_HOME/datasets/micro-benchmarks
```
Tool-specific versions of the synthetic dataset used for this experiment are stored in directories respective to their tool names. To run the tools against the dataset:
```
$ python3 run.py --repeat 5 --tool faial      # runtime: ~27m
$ python3 run.py --repeat 5 --tool pug        # runtime: ~17m
$ python3 run.py --repeat 5 --tool sesa       # runtime: ~7m  /!\ WARNING THIS MAY CRASH DUE TO SESA
$ python3 run.py --repeat 5 --tool gklee      # runtime: ~7m  /!\ WARNING THIS MAY CRASH DUE TO GKLEE
$ python3 run.py --repeat 5 --tool gpuverify  # runtime: ~4hr
```
The above commands were used to produce the results in the paper. We ran all tools 5 times on all problems. The above commands will generate a timings-{TOOL}.csv for each tool. This data contains tool exit statuses, time and memory, paths to tool-specific kernels, paths to tool logs, and DRF or racy results from parsing logs.
- The --repeat 5 option specifies the number of times the experiment will run.
- Specifying the --problem {accs,barriers,ifs,nested-loops,nested-loops-sync} option runs a subset of the synthetic protocols. By default, the entire set will be used.
- The --tool={faial,gpuverify,pug,sesa,gklee} option specifies the tools to run. By default, only Faial is ran. To repeat the experiment on all tools only once:
```
$ python3 run.py --repeat 1 --tool={faial,gpuverify,pug,sesa,gklee}  # runtime: ~1hr
```
- Specifying the --dry-run option prints the command of each tool against each problem without running.
To generate the graph as in the paper, run the following command:
```
$ python3 ../../benchmark/benchmark-graph.py -mb
```
The generated graphs are named Micro-benchmark-time-1-50.pdf and Micro-benchmark-memory-1-50.pdf.

3.3 Claim 3: Real-world usability

Expected runtime of this experiment: ~40 minutes (with --repeat 1) or ~3.5 hours (with --repeat 5).

This section details our experimental dataset, results, and procedure related to Figure 9 in Claim 3: Real-world usability.

All files relating to Claim 3: Real-world usability, are stored in the gpuverify-cav14 directory.
```
$ cd $FAIAL_HOME/datasets/gpuverify-cav14
```
Tool-specific versions of the synthetic dataset used for the experiment are found in directories respective to their tool names. To run each tool against the dataset:
```
$ python3 run.py --repeat 5 --tool faial      # runtime: ~9m
$ python3 run.py --repeat 5 --tool pug        # runtime: ~3m
$ python3 run.py --repeat 5 --tool gpuverify  # runtime: ~3hr
```
The commands above were used to produce the results in the paper. We ran all tools 5 times on all kernels. The above commands will generate a timings-{TOOL}.csv for each tool. This data contains tool exit statuses, time and memory, paths to tool-specific kernels, paths to tool logs, and DRF or racy results from parsing logs.
- The --repeat 5 option specifies the number of times the experiment will be repeated. To repeat the experiment on all tools only once:
```
$ python3 run.py --repeat 1 --tool={faial,gpuverify,pug}  # runtime: ~40m
```
- The --tool={faial,gpuverify,pug} option specifies the tools to run. By default, only Faial is ran.
Lastly, to generate the graph as in the paper, run the following command:
```
$ python3 ../../benchmark/benchmark-graph.py -rw
```
The 3 pie charts for Faial, GPUVerify, and PUG are named faial-stats.pdf, gpuverify-stats.pdf, and pug-stats.pdf respectively. The generated scatter graph is named time-relation-faial-scatter.pdf.

4 Accessing Experimental Results

The Docker container includes a HTTP server exposing $FAIAL_HOME to port 8000. This enables access to download logs, plots, and other files inside the container. To access experimental results, ensure the container is running, and navigate to: localhost:8000

Claim	Path	See
3.1	(steps 3 and 4)	Table 1
3.2	datasets/micro-benchmarks/Micro-benchmark-time-1-50.pdf	Fig 8 (lhs)
3.2	datasets/micro-benchmarks/Micro-benchmark-memory-1-50.pdf	Fig 8 (rhs)
3.3	datasets/gpuverify-cav14/faial-stats.pdf	Fig 9.a
3.3	datasets/gpuverify-cav14/gpuverify-stats.pdf	Fig 9.b
3.3	datasets/gpuverify-cav14/pug-stats.pdf	Fig 9.c
3.3	datasets/gpuverify-cav14/time-relation-faial-scatter.pdf	Fig 9.d

5 Kernel Generation Framework

Optional documentation of our kernel generation and benchmarking framework is provided in FRAMEWORK.md. Details include experiment configuration file parameters and the generation of tool-specific kernels from tool-agnostic templates.

6. Rebuilding this Artifact

This section covers reproducing the Docker container and building Faial from source.

6.1 Building with Docker

To reproduce the Docker container, first install and start Docker.

Building the tar archive

Ensure you are in the root of the artifact. You should see the file Dockerfile.
To build the image, run:
```
$ docker build --tag faial-cav21 . 
```

To save the image, run:

$ docker save faial-cav21 | bzip2 > artifact-354.tar.bz2

Building without Docker

To reproduce this environment natively without Docker, follow along with the commands run by the provided Dockerfile. This is known to work on Ubuntu 20.04; other systems will require modifying package names and commands to those supported provided by your system.

6.2 Building Faial

The source for Faial is split across three repositories: faial, faial-infer, and c-to-json. Each repository is both available online and included with this artifact in directory source/. Note that the source used for the version of Faial in this artifact is located in branches named cav21.

See the Faial README for instructions on building from scratch.

Prebuilt Linux Binaries

We additionally provide prebuilt Linux binaries Faial:

7. Tutorial on Faial usage

As a next step, you may want to view our tutorial on using Faial to verify your own CUDA programs! This may be found locally at source/faial/tutorial/ or online in the Faial source repository.

Additionally, you can also manually run a single kernel from Claim 3's CAV14 dataset, by directly calling faial on said kernel with the --parse-gv-args option. For example:

$ cd $FAIAL_HOME/datasets/gpuverify-cav14/
$ faial --parse-gv-args faial/CUDA20/scan/best/kernel.cu
  Program is data-race free!

The text editors vim and nano are included in the container so you may alter kernels and verify them. Please enjoy exploring verification with Faial.