# Checking Data-Race Freedom of GPU Kernels, Compositionally

Tiago Cogumbreiro<sup>1</sup>, Julien Lange<sup>2</sup>, Dennis Liew Zhen Rong<sup>1</sup>, and Hannah Zicarelli<sup>1</sup>

<sup>1</sup> University of Massachusetts Boston
<sup>2</sup> Royal Holloway, University of London

Abstract. GPUs offer parallelism as a commodity, but they are difficult to program correctly. Static analyzers that guarantee data-race freedom (DRF) are essential to help programmers establish the correctness of their programs (kernels). However, existing approaches produce too many false alarms and struggle to handle larger programs. To address these limitations we formalize a novel compositional analysis for DRF, based on access memory protocols. These protocols are behavioral types that codify the way threads interact over shared memory. Our work includes fully mechanized proofs of our theoretical results, the

first mechanized proofs in the field of DRF analysis for GPU kernels. Our
 theory is implemented in Faial (anonymized name), a tool that outper forms the state-of-the-art. Notably, it can correctly verify at least 1.41×
 more real-world kernels, and it exhibits a linear growth in 4 out of 5

<sup>20</sup> experiments, while others grow exponentially in all 5 experiments.

## <sup>21</sup> 1 Introduction

1

2

3

4

 $\mathbf{5}$ 

6

7

8

9

10

11

12

13

14

15

<sup>22</sup> GPUs are massively parallel devices that promise a great return on investment <sup>23</sup> at a cost: they are notably difficult to program. In GPU programming, hundreds <sup>24</sup> of lightweight threads share portions of arrays in parallel (without locks) — <sup>25</sup> very different from the programming model of multithreaded programs written <sup>26</sup> in C or Java with heavy-weight heterogeneous threads. Data-race freedom (DRF) <sup>27</sup> analysis aims to guarantee that for all possible executions, every array cell being <sup>28</sup> written by one thread cannot be concurrently accessed by another thread.

In the field of static analysis of DRF in GPU programs, there is a tension 29 between efficiency and correctness (no missed data-races and no false alarms) 30 that thus far is unresolved. Bug finding tools [25, 26, 33] favor correctness over 31 efficiency: they provide correct results at small scales, by simulating the program 32 execution. Such tools are incapable of handling certain parameters symbolically 33 (e.q., array size) and can easily exhaust users' resources (e.q., loops with long)34 iteration spaces or unknown bounds). Approaches based on Hoare logic [5,7,21] 35 can cope with medium-sized programs, do not miss data-races, and do not require 36 array size information; however, they suffer from a high-rate of false alarms and 37 require code annotations written by concurrency experts. Finally, tools that can 38



Fig. 1: Work-flow of the verification.

cope with larger programs and do not require array size information either miss
 data-races [23] or overwhelm the user with false alarms [38].

To appease this tension, we introduce a novel static DRF analysis that can 41 handle larger programs and produce fewer false alarms than related work, with- $^{42}$ out missing data-races. Additionally our analysis does not require code anno-43 tations or array size information. Our verification framework hinges on access 44 *memory protocols*, a new family of behavioral types [1] that codify the way  $^{45}$ threads interact through shared memory. Our behavioral types also make evi-46 dent two aspects of the analysis that can be made separate: concurrency analysis 47 (*i.e.*, could these two expressions run in parallel?) and data-race conflict detec-48 tion (*i.e.*, do these array indices match?).  $^{49}$ 

50 Contributions and synopsis This paper includes the following contributions.

(1) In §3, we formalize the syntax, semantics, and well-formedness conditions
for access memory protocols, which are behavioral types for GPU programs.
This behavioral abstraction results in a simpler yet more expressive theory than
previous works, *e.g.*, it does not require user-provided loop invariants.

(2) In §4, we show that our DRF analysis of access memory protocols can be
soundly and completely reduced to the satisfiability of an SMT formula, see
Theorems 1 and 3. Our theory and results on access memory protocols are fully
mechanized in Coq. To the best of our knowledge, this is the first mechanized
proof of correctness of a DRF analysis for GPU programs.

(3) We show that our DRF analysis of access memory protocols is compositional
 when protocols satisfy a structural property, see Theorem 2. Additionally, we
 show how to transform protocols when they do not meet this property.

(4) In §5 we present Faial, which infers access memory protocols from CUDA
 kernels and implements our theory. Our experiments show that Faial is more
 precise and scales better than existing tools.

(5) In §6, we present a thorough experimental evaluation of Faial against related work [5, 23, 25, 26], the largest comparative study of GPU verification (5 tools in 260 kernels, 3 tools compared in 487 kernels). Faial verified 217 out of 227 real-world kernels (at least  $1.42 \times$  more than other tools) and correctly verified more handcrafted tests than other tools (4 out of 5). In a synthetic benchmark suite (250 kernels), Faial is the only tool to exhibit linear growth in 4 out of 5 experiments, while others grow exponentially in all 5 experiments.

<sup>73</sup> Our paper is accompanied by an implementation (Faial, see § A), an evaluation <sup>74</sup> framework (inc. datasets), and proof scripts (in Coq) for each theorem. Should

<sup>75</sup> the paper be accepted, these will be submitted for artifact evaluation.

Listing 2.1: Examples of racy kernels, l.h.s. is from [34] and r.h.s. simplifies l.h.s. for clarity, with one-dimensional array and thread identifier, and 1-stride loops.

```
for (int r = 0; r < N; r++) {
   for (int r = 0; r < N; r++) {
1
     for (int i = 0; i<TILE DIM; i+=BLOCK ROWS)
                                                                      for (int i = 0; i < M; i++)
2
                                                                        { tile [ tid ] =
      { tile [ tid .y+i ][ tid .x] = idata [index in+i*width]; }
3
                                                                 3
                                                                                       ..:}
                                                                         syncthreads();
        syncthreads();
4
                                                                 4
     for (int j = 0; j<TILE DIM; j+=BLOCK ROWS)
                                                                      for (int j = 0; j < M; j + +)
5
                                                                 5
                                                                        \{\ldots = tile [tid+j];\}\}
     { odata[index_out+j*height] = tile [tid.x][tid.y+j];}}
6
                                                                 6
```

## 76 2 Overview

This section gives an overview of our approach by examining a data-race we 77 found in published work [16] and [34]. We discuss the challenges that such ex-78 amples pose to the state-of-the-art of DRF analysis. Then we introduce a veri-79 fication framework based on access memory protocols: behavioral types [1] that 80 codify the way threads interact via shared memory. Figure 1 gives an overview 81 of the verification pipeline. We start from CUDA kernels, from which we infer 82 access memory protocols. Protocols are then checked for well-formedness and 83 transformed in three steps into formulas that are verified by an SMT solver. 84

#### 85 2.1 Challenges of GPU Programming

GPU programming model The key component of GPU programming is the 86 kernel program, or just kernel, that runs according to the Single-Instruction-87 Multiple-Thread (SIMT) execution model, where multiple threads run a single 88 instruction concurrently. A kernel is parameterized by a special variable that 89 holds a thread identifier, henceforth named tid. In parallel, each member of a 90 group of threads runs an instantiated copy of the kernel by supplying its identifier 91 as an argument. Threads communicate via shared memory (arrays) and mediate 92 communication via barrier synchronization (an execution point where all threads 93 must wait for each other before advancing further). Writes are only visible to 94 other threads after a barrier synchronization, *i.e.*, there is no guarantee that a 95 write of a thread can be read by another before a barrier synchronization. 96

GPU programming platforms usually group threads hierarchically in multiple levels, across which no inter-groups synchronization is possible. In both the literature [6, 23] and this work, the focus is on intra-group communication as inter-group errors can be seen as a special case of intra-group errors.

**Challenges** We motivate the difficulty of analyzing data-races by studying a programming error found in the wild, reported in Listing 2.1 (left). This excerpt comes from a tutorial [34] on optimizing numeric algorithms for GPUs. The code listing transposes a matrix N-times with an outer loop indexed by variable r.

Remarkably, the tutorial [34] does not inform the readers that Listing 2.1 contains a subtle *data-race*: one transpose-operation starts (the writes to tile in line 3) without awaiting the termination of the previous transpose-operation Listing 2.2: Minimal representative example of an access memory protocol highlighting the data-race in Listing 2.1.

4

(the reads from tile in line 6), thus corrupting the data over time and possibly skewing the timing of the optimization to appear faster than it should be. We found a related data-race in [16], which reuses code from [34].

Our tool, Faial, successfully identifies the program state that triggers the 111 data-race in Listing 2.1: when r = 1 and N = 2. However, state-of-the-art tools 112 struggle to accurately analyze Listing 2.1, as evaluated in Section 6 (Claim 1: 113 Test 1). Symbolic execution tools, such as [25, 26], timeout for N > 1, and, in 114 general, cannot handle symbolic (unknown) bounds. GPUVerify [6], a tool based 115 on Hoare logic, reports a false alarm: a spurious data-race when r=0 and N=1. 116 And PUG [23] incorrectly identifies the example as DRF, as its analysis appears 117 to ignore data-races originating from different iterations of a loop. 118

#### <sup>119</sup> 2.2 Memory Access Protocols by Example

We now investigate the data-race in Listing 2.1 with an access memory proto-120 col. For presentation purposes, we focus our discussion on Listing 2.1 (r.h.s.), 121 that simplifies the l.h.s. whilst retaining the root cause of its data-race, which 122 stems from the interaction between both loops. We discuss how we support 123 multi-dimensional arrays, multi-dimensional thread identifiers, and arbitrary 124loop strides in Section 5. In our Coq formalism the notion of "accesses" (and 125their dimensions) is a parameter of the theory, thus orthogonal to the theory 126 presented here. 127

<sup>128</sup> Consider the execution of the end of the first iteration (r=0) and the beginning <sup>129</sup> of the second (r=1) iteration of the outer-loop. In this case, the execution of the <sup>130</sup> j-loop when r=0 is not synchronized with the execution of the i-loop when r=1 as <sup>131</sup> there is no call to \_\_syncthreads() in between.

The access memory protocol in Listing 2.2 captures this *partial* execution 132 from the viewpoint of array tile. By design access memory protocols over ap-133 proximate kernels by abstracting away what data is being written to/read from 134 an array, to focus on where data is being written. The protocol models the two 135problematic loops of Listing 2.1, *i.e.*, the j-loop when r=0 and the i-loop when r=1. 136 The first loop reads (rd[tid+j]) from the array, while the second writes (wr[tid]) 137 to it. Evaluation of a protocol follows the SIMT model: each thread evaluates 138 wr[tid] by instantiating tid with their unique identifier (hereafter, an integer), 139 e.g., thread 0 yields wr[0] and thread 1 yields wr[1]. 140

Analysis of unsynchronized protocols We say that a protocol is DRF when
all concurrent accesses are pair-wise DRF, *i.e.*, when issued by different threads
on the same index, then neither access is a write. For instance the respective
sets of concurrent accesses of threads 0 and 1 in Listing 2.2 is given below

$$\begin{array}{c} \operatorname{tid} = 0 & \operatorname{tid} = 1 \\ \{\operatorname{rd}[j] \mid 0 \leq j < M\} \cup \{\operatorname{wr}[0]\} \quad DRF \ with? \quad \{\operatorname{rd}[1+j] \mid 0 \leq j < M\} \cup \{\operatorname{wr}[1]\} \end{array}$$

<sup>145</sup> When M > 1, thread 0 (l.h.s) accesses rd[1] and thread 1 (r.h.s) accesses wr[1]. <sup>146</sup> Thus, there is a data-race on index 1 of the array.

A fundamental challenge of static DRF verification is how to handle loops. 147 Symbolic execution approaches that unroll loops, e.g., [25, 26], cannot handle 148large nor symbolic iteration spaces. Static approaches that use Hoare logic, 149 e.g., [5,7,21], require user-provided loop invariants. Another approach is to re-150 duce loops to verifying the satisfiability of a corresponding universally quantified 151formula, e.g., [24,30]. This has the advantage of being fast and not requiring in-152variants. However, its previous application to GPU programming, *i.e.*, PUG, 153 is unsound due to the interaction between barrier synchronizations and loops, 154e.g., PUG misses the data-race in Listing 2.1. We give more details in Section 6. 155

Our Approach A key contribution of our work is to identify conditions that allow a kernel to be reduced to a first-order logic formula, by precisely characterizing the effect of barrier synchronization in loops. To this end, the language of access memory protocols distinguishes syntactically between protocol fragments that synchronize from those that do not. For instance, the protocol in Listing 2.2 is identified as *unsynchronized*, as it does not include any synchronization.

In Section 4, we show that the DRF analysis of unsynchronized protocols can be precisely reduced to a first-order logic formula, where universally quantified formulae represent loops, thus obviating the need to unroll them explicitly. For instance, we reduce the verification of Listing 2.2 to asking whether for all M,  $t_1$ , and  $t_2$ , where  $t_1 \neq t_2$  are thread identifiers, the following holds:

$$\begin{aligned} \forall j_1, i_1, j_2, i_2 \colon 0 &\leq j_1 < M \land 0 \leq i_1 < M \land 0 \leq j_2 < M \land 0 \leq i_2 < M \\ \{ \mathsf{rd}[t_1 + j_1] \} \cup \{ \mathsf{wr}[t_1] \} \quad DRF \text{ with }? \quad \{ \mathsf{rd}[t_2 + j_2] \} \cup \{ \mathsf{wr}[t_2] \} \end{aligned}$$

This formula is *unprovable* since  $\mathsf{rd}[t_1 + j_1]$  races with  $\mathsf{wr}[t_2]$  when, e.g.,  $t_1 = 0$ , t<sub>2</sub> = 1,  $j_1 = 1$ , and M > 1. Hence, our technique flags Listing 2.2 as racy.

**Analysis of synchronized protocols** The protocol in Listing 2.3 (left) models *all* the interactions over the shared array tile from Listing 2.1. This protocol consists of one outer loop r that contains two inner loops separated by a barrier synchronization (sync). The first inner loop writes (wr[tid]) to the array, while the second reads (rd[tid + j]) from the array.

This protocol illustrates how our language syntactically differentiates between protocols fragments that synchronize from those that do not. Concretely, our language precludes an unsynchronized loop (for<sup>U</sup>  $x \in n..m \{u\}$ ) from calling sync anywhere in u, and it requires that a synchronized loop (for<sup>S</sup>  $x \in n..m \{p\}$ )  $\mathbf{6}$ 

Listing 2.3: Access memory protocol (left) of array tile from Listing 2.1 and its aligned version (right).

| <pre>1 for<sup>S</sup> r in 0N { 2     for<sup>U</sup> i in 0M { wr[tid] } 3     sync; 4     for<sup>U</sup> j in 0M { rd[tid + j] } 5 }</pre> | <pre>1 for<sup>U</sup> i in 0M { wr[tid] } 2 sync; 3 for<sup>S</sup> r in 1N { for<sup>U</sup> j in 0M { rd[tid + j] } 5 for<sup>U</sup> i in 0M { wr[tid] } 6 sync; } 7 for<sup>U</sup> j in 0M { rd[tid + j] }</pre> |
|------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|

includes at least one occurrence of sync. The superscript U (resp. S) stands for synchronized (resp. *u*nsynchronized). This distinction can be inferred automatically and yields a compositional analysis, as we explain below.

The behavior of synchronized loops is difficult to analyse because they may contain data-races that span more than one iteration. For instance an instruction of iteration r in Listing 2.3 may race with an instruction of iteration r+1.

Our Approach In this work we show that the DRF analysis of synchronized 184 protocols can safely be reduced to a first-order logic formula when such loops 185 are *aligned*, *i.e.*, when there is a synchronization exactly before the loop and at 186 the end of its body. In Section 4.1 we show how to transform an arbitrary access 187 memory protocol into an aligned protocol using a syntax-driven transformation 188 technique called *barrier aligning*. Intuitively, barrier aligning normalizes loops 189 so that they do not "leak" accesses between iterations. The right-hand side of 190 Listing 2.3 shows the result of applying *barrier aligning* on the protocol from 191 Listing 2.3 (left). Observe that the fragment before the aligned loop (line 1) 192 corresponds to the unsynchronized part of the original loop (before sync). The 193 original loop itself is rearranged so that the part succeeding sync is moved to 194 the beginning of the aligned loop (lines 3–6). The fragment following the aligned 195loop (line 7) corresponds to the unsynchronized loop that appears after the sync 196 in the original protocol. 197

In Section 4.1 we show that aligned protocols enable *compositional* verification: protocol fragments between two barriers can be analyzed independently. This compositional analysis is possible because (i) there is no causality between instructions, except through sync and (ii) aligned protocols syntactically delimit the causality induced by sync. For instance, the aligned protocol in Listing 2.3 can be reduced to analyzing the following three protocol fragment independently:

$$\begin{array}{l} \text{for}^{\text{U}} \ i \in 0..M \ \{\text{wr[tid]}\} & \text{for}^{\text{U}} \ j \in 0..M \ \{\text{rd[tid}+j]\} \\ \text{for}^{\text{S}} \ r \in 1..N \ \{\text{for}^{\text{U}} \ j \in 0..M \ \{\text{rd[tid}+j]\}; \text{for}^{\text{U}} \ i \in 0..M \ \{\text{wr[tid]}\}; \text{sync}\} \end{array}$$

The first two protocols are handled like Listing 2.2 because they are unsynchronized. Representing a synchronized loop as a formula becomes possible when the protocol is *aligned*: both threads must share the same value for r at each iteration. Hence, we reduce the verification to asking whether for all  $N, M, t_1$ , and  $t_2$  where  $t_1 \neq t_2$  and the following holds:

$$\forall \mathbf{r}, j_1, i_1, j_2, i_2 \colon \mathbf{1} \le \mathbf{r} < \mathbf{N} \land \mathbf{0} \le j_1 < \mathbf{M} \land \mathbf{0} \le i_1 < \mathbf{M} \land \mathbf{0} \le j_2 < \mathbf{M} \land \mathbf{0} \le i_2 < \mathbf{M}$$
$$\implies \{ \mathsf{rd}[t_1 + j_1] \} \cup \{ \mathsf{wr}[t_1] \} \quad DRF \text{ with } ? \quad \{ \mathsf{rd}[t_2 + j_2] \} \cup \{ \mathsf{wr}[t_2] \}$$

Our technique identifies Listing 2.3 as racy since this formula is *unprovable*, *i.e.*, rd[ $t_1+j_1$ ] races with wr[ $t_2$ ] when r = 1,  $t_1 = 0$ ,  $t_2 = 1$ ,  $j_1 = 1$ , N > 1 and M > 1.

#### <sup>211</sup> 3 Access Memory Protocols

An access memory protocol describes the interaction between a group of threads 212and a single shared-memory location. A protocol records where in memory ac-213cesses take place, but abstracts away from *what* data is read from/written to 214 memory. The language of protocols distinguishes between an unsynchronized 215protocol fragment  $u \in \mathcal{U}$ , that disallows synchronization, from a synchronized 216fragment  $p \in S$  that must include a synchronization. The syntax and semantics 217of access memory protocols is given in Figure 2. Our operational semantics is in-218 spired by the synchronous, delayed semantics (SVD) from Betts et al. [6], where 219 threads execute independently and communicate upon reaching a barrier. 220

Hereafter, i, j, k are metavariables over non-negative integers picked from the set N. An arithmetic expression n is either: an integer variable x, an integer i, or a binary operation on integers that yields an integer. A boolean expression bis either a boolean literal, an arithmetic comparison  $\diamond$ , or a propositional logic connective  $\circ$ . We write  $n \downarrow i$  when expression n evaluates to integer i, where evaluation is defined in the natural way. We overload the notation for Boolean expressions, e.g.,  $b \downarrow true$  means that expression b evaluates to true.

Unsynchronized fragment A protocol  $u \in \mathcal{U}$  either does nothing (skip), accesses a shared memory location o[i] (reads from/writes to index *i*), performs sequential composition, or loops. Figure 2 gives the semantics of unsynchronized protocols, which is parameterized by a set of thread identifiers  $\mathcal{T} \subseteq \mathbb{N}$ , where  $|\mathcal{T}| \geq 2$ .

Evaluation of an unsynchronized protocol u by a thread identifier i, written 232  $u \downarrow_i P$ , yields a phase, *i.e.*, a set  $P \in \mathcal{P}$  of access values  $\alpha \in \mathbb{A}$ . Each access 233 value, or just access, notation i:o[j], consists of its issuing thread identifier i, 234 an access mode o (read/write), and an index j. Protocol skip produces no ac-235 cesses. A memory access o[n] evaluates the index and creates a singleton phase. 236 Sequencing, branching, and looping are standard. Similarly to SVD, Rule  $\mathcal{U}$ -par 237 executes a copy of the unsynchronized code u for each thread  $i \in \mathcal{T}$  by replacing 238 the special variable tid by the thread identifier, u[tid := i], which results in the 239 union of the accesses of all threads. 240

Synchronized fragment A protocol  $p \in S$  may perform barrier synchronization sync, run unsynchronized code u, perform sequential composition, and loop. Figure 2 gives the semantics of a protocol, notation  $p \downarrow H$ . Evaluation of a protocol p yields a *history* (ranged over by H), *i.e.*, a list of phases (P) that records how memory was accessed. We use :: as list constructor and  $\cdot$  for the usual list concatenation operator. Histories are concatenated using the special  $\odot$ -operator.

Syntax  

$$\begin{split} \mathbb{N} \ni i & ::= 0 \mid 1 \mid \cdots & o \quad ::= wr \mid rd \\ n & ::= x \mid i \mid n \star n & \mathbb{A} \ni \alpha \quad ::= i:o[i] \\ \mathcal{B} \ni b & ::= true \mid false \mid n \diamond n \mid b \diamond b & \mathcal{P} \ni P \quad ::= \{\alpha_1, \dots, \alpha_n\} \\ \mathcal{U} \ni u & ::= skip \mid o[n] \mid u; u \mid for^{\mathcal{V}} x \in n..m \{u\} & \mathcal{P} \ni P \quad ::= \{\alpha_1, \dots, \alpha_n\} \\ \mathcal{S} \ni p & ::= sync \mid u \mid p; p \mid for^S x \in n..m \{p\} & \mathcal{U} = [1 \mid P::H] \\ \end{split}$$
Big-step semantics for  $\mathcal{U}$ 

$$\begin{split} \mathcal{U}_{-SKIP} & \mathcal{U}_{-ACC} & \mathcal{U}_{-SEQ} & \mathcal{U}_{-FOR-1} \\ \frac{\mathcal{U}_{-SKIP}}{skip \downarrow_i \emptyset} & \frac{n \downarrow j}{o[n] \downarrow_i \{i:o[j]\}} & \frac{\mathcal{U}_{-SEQ}}{u_1; u_2 \downarrow_i P_1 \cup P_2} & \mathcal{U}_{-FOR-1} \\ \frac{\mathcal{U}_{-FOR-2} & (n < m) \downarrow true & u[x := n] \downarrow_i P_1 & for^{\mathbb{V}} x \in n + 1..m \{u\} \downarrow_i P_2 \\ \frac{\mathcal{U}_{-PAR} & \mathcal{U}_{-PAR} \end{split}$$

$$\frac{S = \bigcup \{u[\mathsf{tid} \coloneqq i] \downarrow_i P_i \mid i \in \mathcal{T}\}}{u \downarrow_{\mathcal{T}} S}$$

History concatenation and serialization

 $H\cdot H$  $H \odot H$ 

 $p \in \mathcal{W}$ 

safe(H)

 $[P_1 \dots P_n] \cdot [P_{n+1} \dots P_{n+k}] = [P_1 \dots P_{n+k}] \quad (H \cdot [P]) \odot ([P'] \cdot H') = H \cdot [P \cup P'] \cdot H'$  $p \downarrow H$ 

Big-step semantics for  $\mathcal{S}$ 

$$\begin{array}{c} \mathcal{S}\text{-SYNC} \\ \hline \mathbf{Sync} \downarrow [\emptyset, \emptyset] \end{array} \quad \begin{array}{c} \mathcal{S}\text{-PAR} \\ \frac{u \downarrow_{\mathcal{T}} P}{u \downarrow [P]} \end{array} \quad \begin{array}{c} \mathcal{S}\text{-SEQ} \\ \frac{p \downarrow H}{p; q \downarrow H \odot H'} \end{array} \quad \begin{array}{c} \mathcal{S}\text{-FOR-1} \\ \frac{(n+1=m) \downarrow \texttt{true}}{\texttt{for}^{\$} x \in n..m \ \{p\} \downarrow H \end{array} \\ \\ \begin{array}{c} \mathcal{S}\text{-FOR-2} \\ \frac{(n < m) \downarrow \texttt{true}}{\texttt{for}^{\$} x \in n..m \ \{p\} \downarrow H'} \end{array} \quad \begin{array}{c} \mathcal{S}\text{-FOR-1} \\ \frac{(n+1=m) \downarrow \texttt{true}}{\texttt{for}^{\$} x \in n..m \ \{p\} \downarrow H \end{array}$$

Well-formed protocols

$$u\,;\,\mathsf{sync}\in\mathcal{W}\qquad \frac{p\in\mathcal{W}\quad q\in\mathcal{W}}{p\,;\,q\in\mathcal{W}}\qquad \frac{p\in\mathcal{W}\quad \mathsf{tid}\notin fv(n)\cup fv(m)}{u_1\,;\,\mathsf{for}^{\mathrm{s}}\;x\in n..m\;\{p\,;\,u_2\}\in\mathcal{W}}$$

Data-race, safe phase, and safe history

$$\frac{\mathsf{wr} \in \{o, o'\} \quad i \neq j}{i:o[k] \# j:o'[k]} \qquad \qquad \frac{\forall \alpha, \beta \in P \colon \neg(\alpha \# \beta)}{safe(P)} \qquad \qquad \frac{\forall P \in H \colon safe(P)}{safe(H)}$$

 $\alpha \, \# \, \beta$ 

safe(P)

Fig. 2: Syntax, semantics, and properties of access memory protocols.

A barrier synchronization creates two empty phases, corresponding to phases 247before and after synchronization. Running an unsynchronized protocol yields a 248 single phase containing all accesses performed by that protocol. Sequencing two 249 synchronized protocols p with q merges the last phase of the former with the first 250phase of the latter, as these two phases run concurrently. Running one iteration 251of a synchronized loop sequences the execution of the first iteration with the 252rest of the loop, by merging the last phase of the first iteration with the first 253phase of the rest of the loop. Synchronized loops in access memory protocols are 254nonempty, hence the base case is when there is one iteration left. This additional 255requirement helps with the presentation of our theory as it implies that every 256synchronized loop always executes at least one synchronization. 257

A protocol is well-formed, written  $p \in W$ , if every unsynchronized fragment is followed by a barrier synchronization, every synchronized loop includes a barrier and is not branching on thread-local variables, *i.e.*, tid. We write fv(p) (resp. fv(n)) for the free variables of p (resp. n). We discuss how well-formedness is enforced in Section 5.

DRF is formalized at the bottom of Figure 2. Two accesses are in a data-race (or racy) when there exist two different threads that access the same index k, and one of these accesses is a write. Phase P is *safe* iff each pair of access it contains is not racy. History P is *safe* when all of its phases are safe.

#### <sup>267</sup> 4 DRF-Preserving Transformations of Protocols

This section presents the main steps of the DRF analysis summarized in Figure 1: barrier aligning (Section 4.1) and splitting (Section 4.2).

This section also includes our key theoretical results. We establish that these steps preserve and reflect data-races (*i.e.*, any and all data-races are found), see Theorem 1 and Theorem 3. We make precise the notion of compositionality that makes our approach scalable in Theorem 2.

#### 274 4.1 Aligning Protocols

The first transformation step normalizes protocols by aligning synchronized loops, which in turn enables a form of compositional verification. The goal of the transformation is to produce protocols which belong to  $\mathcal{A}$ , see top of Figure 3. *Barrier aligning* (or just aligning) is performed by function *align*, given in

the bottom half of Figure 3. The function returns a pair whose first element is an 279 aligned and synchronized protocol, and whose second element is an unsynchro-280 nized protocol. Intuitively, the pair represents a sequence which we delimitate 281syntactically. We note that the output of align, say (q, u), can be trivially made 282 into an aligned protocol: q; u; sync. The case for synchronization is simple, *align* 283 returns the input protocol as the first component of the pair and skip as the 284second component (the input protocol is already fully aligned). The case for 285sequence consists of the sequential composition of the pair aligned with unsyn-286 chronized code using operator (3). Sequencing two pairs  $(p, u) \in (q, u')$  amounts 287 to sequencing u to the outer-most piece of unsynchronized code present in q. 288

| Aligned protocols                                                                                |                                                                           | $p \in \mathcal{A}$                                                                                                       |
|--------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------|
| $u$ ; sync $\in \mathcal{A}$ $\frac{p}{2}$                                                       | $\frac{\in \mathcal{A} \qquad q \in \mathcal{A}}{p  ; q \in \mathcal{A}}$ | $\frac{p \in \mathcal{A}  q \in \mathcal{A}}{p; for^{^{\mathrm{S}}} \; x \in nm \; \{q\} \in \mathcal{A}}$                |
| Sequencing aligned protocols                                                                     | $ {}_{9}^{\circ} : \mathcal{U} \to \mathcal{A} \to \mathcal{A} $          | ${}_{\S} \colon (\mathcal{A} \times \mathcal{U}) \to (\mathcal{A} \times \mathcal{U}) \to \mathcal{A} \times \mathcal{U}$ |
| $u_{\scriptscriptstyle 9}^{\scriptscriptstyle 0}(\boldsymbol{u}';sync)=(u;\boldsymbol{u}');sync$ | $u_{9}^{\circ}(p;q) = (u_{9}^{\circ}p);q$                                 | $q \qquad (p, u)_{_{9}}^{_{9}}(q, u') = (p; (u_{_{9}}^{_{9}}q), u')$                                                      |
| Aligning protocols                                                                               |                                                                           | $\boxed{\textit{align} \colon \mathcal{W} \to \mathcal{A} \times \mathcal{U}}$                                            |
| align(u; sync)                                                                                   | = (u; sync, skip)                                                         | $align(p;q) = align(p) \ \ align(q)$                                                                                      |
| align(p) = (q                                                                                    | $(u_3)$ $q_1 = u_1 \circ q[x]$                                            | $\coloneqq n$ ] $u = u_3; u_2$                                                                                            |
| $align(u_1; \text{for}^{s} x \in nm \{p; u_2\}$                                                  | $\}) = (q_1; for^{\mathtt{S}} \ x \in n+1$                                | $1m \ \{u[x \coloneqq x-1] \circ q\}, \ u[x \coloneqq m-1])$                                                              |

Fig. 3: Aligning protocols.

Dealing with synchronized loops is more involved. Given a loop  $u_1$ ; for<sup>S</sup>  $x \in$  $n.m \{p; u_2\}$ , we produce a protocol consisting of the fragment preceding the loop and the synchronized part of its first iteration  $(q_1)$ , an aligned loop starting n+1, and the unsynchronized part of its last iteration (u[x := m-1]). See Listing 2.3 for an example of protocol aligning. We note that we can always unroll the loop because the analysis only considers nonempty synchronized loops; we discuss how to enforce this assumption in Section 5.

We now state two fundamental properties of barrier aligning: preserving and reflecting DRF (Theorem 1), and enabling compositional verification (Theorem 2). Theorem 1 states that verifying DRF of a well-formed protocol p is equivalent to verifying DRF of its aligned counterpart.

Theorem 1 (Correctness of Align). Let align(p) = (q, u) and  $p \in W$ . If so  $p \downarrow H_1$  and  $q; u \downarrow H_2$ , then  $safe(H_1)$  if and only if  $safe(H_2)$ .

<sup>302</sup> To state our compositionality result, we introduce a language of contexts:

 $\mathcal{C} ::= [\_] \mid u; \mathsf{sync} \mid p; \mathcal{C} \mid \mathcal{C}; p \mid \mathcal{C}; \mathsf{for}^{\mathsf{S}} x \in n..m \{p\} \mid p; \mathsf{for}^{\mathsf{S}} x \in n..m \{\mathcal{C}\}$ 

The base cases correspond to a hole [\_] or an unsynchronized protocol (followed by sync). The other cases follow the structure of access memory protocols.

Theorem 2 (Compositionality). Let C be a context, s.t. C[skip; sync] is DRF, and  $C[\text{skip}; \text{sync}] \downarrow H$ . For all  $p \in A$ , if p is DRF,  $p \downarrow H'$ , and  $fv(p) \subseteq \{\text{tid}\}$ , then  $C[p] \in A$  and C[p] is also DRF. Syntax

 $\mathcal{L} \ni h ::= \mathsf{skip} \mid n:o[m] \mid h; h \mid \mathsf{var} \ x \ \mathsf{in} \ n..m; h$ 

Product of histories

$$H_1 \otimes H_2 = [P_1 \cup P_2 \mid (P_1, P_2) \in H_1 \times H_2]$$

**Big-step** semantics

$$\frac{n \downarrow i \quad m \downarrow j}{n:o[m] \Downarrow [\{i:o[j]\}]} \qquad \frac{h_1 \Downarrow H_1 \quad h_2 \Downarrow H_2}{h_1;h_2 \Downarrow H_1 \otimes H_2} \qquad \frac{(n \ge m) \downarrow \texttt{true}}{\mathsf{var} \ x \ in \ n...m; h \Downarrow [\emptyset]}$$
$$\frac{(n < m) \downarrow \texttt{true} \quad h[x := n] \Downarrow H_1 \quad \texttt{var} \ x \ in \ n + 1..m; h \Downarrow H_2}{\mathsf{var} \ x \ in \ n...m; h \Downarrow H_1 \cdot H_2}$$

Projection

$$trace(o[n]) = \text{tid:}o[n] \qquad \quad trace(\text{for}^{\texttt{U}} \ x \in n..m \ \{u\}) = \text{var} \ x \text{ in } n..m; trace(u)$$

$$trace(u_1; u_2) = trace(u_1); trace(u_2)$$
  $trace(skip) =$ 

Splitting protocols

 $split(p;q) = split(p) \cdot split(q)$ 

$$\frac{t_1, t_2 \text{ fresh } h_1 = trace(u)[\text{tid} := t_1] \quad h_2 = trace(u)[\text{tid} := t_2]}{split(u; \text{sync}) = [\text{var } t_1 \text{ in } 1..|\mathcal{T}|; \text{var } t_2 \text{ in } 0..t_1; h_1; h_2]}$$

$$split(p; \mathsf{for}^{\mathsf{s}} x \in n..m \{q\}) = split(p) \cdot [\mathsf{var} x \mathsf{ in} n..m; h \mid h \in split(q)]$$

Fig. 4: Syntax and semantics of symbolic traces, and splitting of protocols.

#### 4.2Splitting Protocols into Symbolic Traces 308

The second verification step, *splitting*, consists in transforming an aligned proto-309 col into symbolic traces, *i.e.*, symbolic representations of sets of memory accesses 310 which occur between two synchronizations. 311

Symbolic traces Intuitively, symbolic traces are a thin abstraction over an SMT 312 formula. We describe how to translate a symbolic trace to a formula in Section 5. 313

We give the syntax and semantics of symbolic traces in Figure 4. Expres-314sion skip terminates a trace. Expression n:o[m] states that thread n accesses 315index m with mode o. Expression  $h_1$ ;  $h_2$  composes two symbolic traces using 316 operator  $\otimes$ , also given in Figure 4. Expression var x in n.m; h binds variable x 317 in h, where variable x is an integer in the range induced from n and m. The 318 semantics of a symbolic trace yields a history with a phase for each possible vari-319

 $H\otimes H$ 

*trace* :  $\mathcal{U} \to \mathcal{L}$ 

skip

 $split: \mathcal{A} \to [\mathcal{L}]$ 

 $h \Downarrow H$ 

able assignment. Expression skip yields a single empty phase. Expression n:o[m]320 evaluates to a singleton set that contains the access value that results from eval-321 uating the thread-identifier expression n and the index expression m. Sequencing 322 histories  $h_1$ ;  $h_1$  consists of performing the product of phases (operator  $\otimes$ ), which 323 consists of merging every phase of  $H_1$  with every phase of  $H_2$ . A variable binder 324 behaves like a skip when the range of values is empty. Otherwise, we fork two his-325 tories  $H_1$  and  $H_2$ . We assign the lower bound of the set in  $H_1$ , and we recursively 326 evaluate a variable binder where we increment its lower bound in  $H_2$ . 327

*Barrier splitting* is the transformation from aligned protocols to symbolic traces, 328 performed via functions trace and split, defined in Figure 4. Function trace 329 extracts the symbolic trace of an unsynchronized program for a single thread. 330 Memory accesses are tagged with the owner thread tid, and unsynchronized loops 331 are converted into variable bindings. Function *split* returns a list of symbolic 332 traces. The case for p; q is trivial (operator  $\cdot$  stands for list concatenation). The 333 base case of *split* is for unsynchronized protocol fragment u, which produces a 334 list containing a single symbolic trace. It introduces fresh variables  $t_1$  and  $t_2$ 335 336 that represent two (distinct) symbolic thread identifiers. The rest of the trace consists of the trace of u instantiated to the first thread identifier  $t_1$  followed 337 by its instantiation to the second thread identifier  $t_2$ . The case for synchronized 338 loops simply reinterprets the loop as a variable binder. Function *split* leads to an 339 exponential blow up wrt. nesting of synchronized loops, but this has not posed 340 problems in practice, c.f., Claim 2. 341

*Example 1.* Let  $\hat{p} = wr[tid + 1]$ ; rd[tid + 2]; sync. We have that  $split(\hat{p})$  returns:

var  $t_1$  in 1.. $|\mathcal{T}|$ ; var  $t_2$  in 0.. $t_1$ ;  $t_1$ :wr $[t_1+1]$ ;  $t_1$ :rd $[t_1+2]$ ;  $t_2$ :wr $[t_2+1]$ ;  $t_2$ :rd $[t_2+2]$ 

<sup>343</sup> We show that barrier splitting preserves and reflects DRF.

Theorem 3. Let  $p \in A$ , such that  $p \downarrow H_1$ , and  $H_2 = [H \mid h \in split(p) \land h \Downarrow H]$ , states then  $safe(H_1)$  if and only if  $safe(H_2)$ .

Hence we have established that aligning (Theorem 1) and splitting (Theorem 3) preserve and reflect data-races, *i.e.*, any and all data-races are found. Thus, the only source of approximation in our analysis stems from the inference of protocols from CUDA kernels, which we discuss in the next section. Theorem 3 highlights the compositionality of our analysis, as each symbolic trace resulting from function *split* can be analyzed independently.

## 352 5 Implementation

In this section we present our tool, Faial, that implements the steps described in Figure 1. Faial takes a CUDA kernel as input and produces results that either identify the kernel as DRF or list specific data-races. In this section, we describe the implementation of the protocol inference, well-formedness checks, and transformation to SMT.

Inference This step transforms a CUDA kernel into access memory protocols 358 (one for each shared array). It uses libclang [22] to parse the kernel, a standard 359 single static assignment (SSA) transformation to simplify the analysis of indices 360 and arrays, and code slicing to only retain code related to *shared* array accesses. 361 We note that Faial supports constructs of the CUDA programming model that 362 are not directly modeled by access memory protocols, *e.g.*, unstructured loops, 363 conditionals, function calls, and multi-dimensional arrays. To support multi-364 dimensional thread identifiers, we extend the language of protocols to support 365 multiple thread identifiers, and adapt function *split* accordingly. The main chal-366 lenges are related to loops and function calls. 367

Whenever possible loops are transformed to loops with a stride of 1 following 368 ideas from loop normalization [23] and abstraction [30]. For instance, in **for**(int 369 i=lb;i<ub;i=s we change the stride from s into 1 by executing the loop body S 370 when the loop variable i is divisible by stride, *i.e.*, the loop becomes **for**(int 371 i=lb;i<ub;i++)if((i+lb)%s==0){S}. Similarly, a loop ranging over powers of n, e.g., 372 for(int i=lb;i<ub;i\*=s), becomes for(int i=lb;i<ub;i++)if(powerof(i,s)){S}, where func-373 tion powerof(i,s) tests whether i is a power of base s. We approximate whiles as 374 a structured loop with an unknown upper bound. 375

Function calls that manipulate shared memory are uncommon in GPU programming. Additionally auxiliary functions that manipulate shared memory have a compiler annotation to inline their bodies, hence we can inline such calls easily. Faial cannot handle recursive functions, but these rarely occur in practice. Function calls that do not access shared memory are simply discarded.

Well-formedness This step ensures that kernels Faial analyzes meet the well-381 formedness conditions  $(p \in \mathcal{W})$  defined in Figure 2, as well as the assumptions 382 that synchronized loops iterate at least once. First, Faial annotates loops with a 383 synchronized/unsynchronized tag according to the presence of sync in the loop 384 body, then adjusts the precedence of sequencing to group all unsynchronized code 385 preceding a sync or a synchronized loops. Synchronized loops of well-formed pro-386 tocols cannot manipulate thread-local variables (i.e., tid), an assumption shared 387 by the CUDA programming model. Hence, Faial flags such kernels as erroneous. 388 Next, Faial adds assertions before/after synchronized loops to check that the 389 loop range is nonempty, *i.e.*, loops execute at least once. Similarly to loops, 390 conditionals are tagged as synchronized or unsynchronized. Then, Faial inlines 391 synchronized conditionals, *i.e.*, when a synchronized conditional is found, two 392 copies of the input program are created and each copy is prefixed by a global 393 assertion corresponding to the condition. Faial does not support synchronized 394 conditionals that appear within synchronized loops. We have not found real-395 world kernels that include such a construction. 396

Quantification This step transforms each symbolic trace (Figure 4) into an SMT
 formula, to check for safety, c.f., Figure 2. Essentially, the generated formula
 guarantees that the indices of array accesses are distinct when there is at least
 one write. We illustrate this straightforward transformation with Example 2.

#### 14 T. Cogumbreiro, J. Lange, D. Liew Z.R., and H. Zicarelli

401 Example 2. The formula generated from the trace in Example 1 is given below:

$$\begin{aligned} \forall t_1, t_2 \colon 1 \leq t_1 < 3 \land 0 \leq t_2 < t_1 \land (\mathsf{m}_1 = \mathsf{wr} \lor \mathsf{m}_2 = \mathsf{wr}) \implies \\ \left( (\mathsf{idx}_1 = t_1 + 1 \land \mathsf{m}_1 = \mathsf{wr}) \lor (\mathsf{idx}_1 = t_1 + 2 \land \mathsf{m}_1 = \mathsf{rd}) \right) \\ \land \left( (\mathsf{idx}_2 = t_2 + 1 \land \mathsf{m}_2 = \mathsf{wr}) \lor (\mathsf{idx}_2 = t_2 + 2 \land \mathsf{m}_2 = \mathsf{rd}) \right) \land \ \mathsf{idx}_1 \neq \mathsf{idx}_2 \end{aligned}$$

where each symbolic access is translated to a conjunction representing its index (idx) and access mode (m). Observe that the formula enforces that indices  $idx_1$ and  $idx_2$  (executed by distinct threads) are different.

For multi-dimensional arrays, we generate one pair of indices per dimension, and check that at least one pair is distinct.

## 407 6 Experimental Evaluation

We evaluate Faial over several datasets and show how it fares against existing approaches. We structure this evaluation in three claims.

Claim 1: Correctness. We claim that our approach finds more bugs and raises fewer false alarms than existing tools. To evaluate this claim, we compare Faial against four state-of-the-art kernel verification tools over 10 kernels that are known to be tricky to analyze.

Claim 2: Scalability. We claim that our approach scales better to larger programs. To evaluate this claim, we compare Faial against other tools over a set
of synthetic benchmarks designed to test the limits of each tool, in terms of run
time and memory usage.

Claim 3: Real-world usability. We claim that our approach is more usable
than existing static verification tools on real-world CUDA programs. To evaluate
this claim, we use a varied dataset of real-world DRF kernels and measure the
false alarm rate, run time, and memory usage of Faial, GPUVerify, and PUG.

Benchmarking environment To make our evaluation reproducible, we developed a benchmarking framework to automate our experiments over the different tools and datasets. For Claim 1 and Claim 3, we designed a tool-agnostic file format for kernel functions and associated metadata (*e.g.*, expected result of DRF analysis, grid and block dimensions, and include directives). And for Claim 2, we created a tool that generates kernels according to given templates, *e.g.*, see Figure 7.

We evaluate Faial against the following verification tools: GPUVerify [5] v2018-03-22; PUG [23] v0.2; and, GKLEE [25] and SESA [26] v3.0. Experiments for Claim 1 use an Intel i5-6500 CPU, 7.7GiB RAM, and Fedora 33 OS, while Claim 2 and Claim 3 use an Intel i7-10510U CPU, 16GiB RAM, and Pop! OS.

*Excluded tools* We excluded ESBMC-GPU [33] and Simulee [38] from the evaluation because we were unable to get them to run satisfactorily. Both tools have rudimentary support for verifying arbitrary CUDA kernels. ESBMC-GPU did not find a single data-race in our benchmarks, while Simulee produced false alarms for every DRF-kernel given.

Table 1: Results for Claim 1. DRF indicates that a (static analysis) tool reported a test case as DRF. NRR indicates that a (symbolic execution) tool did not report any data-race. Label x/y indicates that the tool reported x data-races, y of which are actual races. Label *timeout* indicates that the tool did not terminate within 90s. A test passes if the tool returns the expected result and all reported races are valid.

| Test                   | Expected        | Faial      | GPUVerify         | PUG                            | GKLEE              | SESA               |
|------------------------|-----------------|------------|-------------------|--------------------------------|--------------------|--------------------|
| 1 transposeDiagonal    | Racy<br>DRF     | 1/1<br>DRF | $0/2 \ 0/1$       | DRF<br>DRF                     | $timeout\ timeout$ | $timeout\ timeout$ |
| 2 first-iter           | Racy<br>DRF     | 1/1<br>DRF | 0/1<br>0/1        | $\frac{1}{1}$<br>$\frac{0}{1}$ | $timeout\ timeout$ | $timeout\ timeout$ |
| 3 last-iter            | Racy<br>DRF     | 1/1<br>DRF | $\frac{1/1}{0/1}$ | 0/1<br>DRF                     | $timeout\ timeout$ | timeout<br>timeout |
| 1 last-iter-first-iter | Racy<br>DRF     | 1/1<br>DRF | 0/1<br>0/1        | 0/1<br>0/1                     | $timeout\ timeout$ | timeout<br>timeout |
| 5 read-index           | Racy<br>DRF     | 0/1<br>0/1 | 1/1<br>DRF        | 0/1<br>0/1                     | NRR<br>NRR         | NRR<br>NRR         |
| Number of tests pas    | ssed (of $5$ ): | 4          | 1                 | 0                              | 0                  | 0                  |

#### 437 Claim 1: Correctness

We have selected a set of tricky kernels to expose false alarms and missed data-438 races in Faial, GPUVerify, PUG, GKLEE, and SESA. Our results are reported 439 in Table 1. The dataset consists of 5 tests, each consisting of two variations 440 of the same kernel: one racy and one DRF. The racy version of Test 1 (c.f., 441 Listing 2.1) contains an inter-iteration data-races. The DRF version adds a sync 442 after the second inner loop. Tests 2 to 4 expose various loop-related data-races. 443 Their protocols are given in Figure 5. In the racy version of Test 2 wr[tid + 1] 444 conflicts with wr[tid] of the first iteration. Similarly, in the racy version of Test 3, 445wr[tid + 1] of the last iteration races with wr[tid]. In the racy version of Test 4 the 446 last iteration of a nested loop races with the first iteration of the following loop. 447 Test 5 exposes the abstraction gap between kernel and access memory protocols 448 (which abstract away array elements), see Figure 6. 449

Faial passes more tests than any other tool. Failed Test 5 (two false alarms) is caused by access memory protocols abstracting away from *what* data is being read from/written to arrays, *i.e.*, array elements. We report on performance trade-offs wrt. tracking array elements in Claim 2.

<sup>454</sup> GPUVerify passes Test 5 because it tracks array elements, but fails the re-<sup>455</sup> maining 4 tests. Some reported false alarms are ill-formed, *e.g.*, on the racy <sup>456</sup> component of Test 2, the report (0 : wr[tid]; 16 : wr[tid]) has disjoint indices.

<sup>457</sup> PUG obtains the worst score amongst static tools. Notably, the tool misses a <sup>458</sup> data-race in Test 1, demonstrating its unsoundness, *c.f.*, Section 2.1.

GKLEE and SESA timeout for tests that include loops, as the loop bounds are unknown. Both tools miss the data-race in Test 5. Symbolic tools may be able to report data-races when the bound is known, *e.g.*, timeouts start in Test 1 when the bound is at least 2, in Test 2 when the bound is at least 23,000.

| <pre>// first-iter wr[tid+1]; for<sup>5</sup> × in 0N {     if (x &gt; 0)</pre> | <pre>// last-iter for<sup>s</sup> × in 0N {     sync;     if (tid &lt;  T -1)</pre> | <pre>// last-iter-first-iter for<sup>s</sup> x in 1N+1 {     for<sup>s</sup> y in 1x+1 {         sync; wr[tid+x+y]}; for<sup>s</sup> z in N*2N*3 {</pre> |  |
|---------------------------------------------------------------------------------|-------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------|--|
| <mark>{</mark> wr[tid] <mark>}</mark> ;<br>sync}                                | <mark>{</mark> wr[tid+1] <mark>}</mark> ;<br>wr[tid +  T ]                          | wr[tid+z +1]; sync}                                                                                                                                      |  |

Fig. 5: Protocols for Tests 2 to 4, *c.f.*, Claim 1, where N is a free thread-global variable. Yellow shaded code only appears in the DRF version of first-iter and last-iter. Red shaded code only appears in the racy version of last-iter-first-iter.

|  | <pre>// DRF kernel A[tid] = tid; int x = A[tid]; A[x] = 0;</pre> | // Protocol A<br>wr[tid];<br>rd[tid];<br>wr[x] |
|--|------------------------------------------------------------------|------------------------------------------------|
|--|------------------------------------------------------------------|------------------------------------------------|

Fig. 6: Kernels and protocols for Test 5 (read-index), *c.f.*, Claim 1; x becomes a free thread-local variable as protocols do not model array elements.

#### 463 Claim 2: Scalability

We evaluate the scalability of our approach with a synthetic dataset that aims 464 at demonstrating how different kernel constructs affect run time and memory 465usage of Faial, GKLEE, GPUVerify, PUG, and SESA. Our dataset is divided into 466 five categories, one per syntactical construct in the language of access mem-467 ory protocols, as well as conditionals, which are supported by our inference step, 468 c.f., Section 5. Figure 7 shows the protocols of the kernel patterns we generate in 469 each category: (i) repeated accesses (read then write), (ii) repeated barrier syn-470 chronizations separated by writes, (iii) repeated conditionals, (iv) increasingly 471 nested unsynchronized loops, and (v) increasingly nested synchronized loops. In 472each category, we vary the problem size by repeating a pattern from 1 to 50 473 times. Note that all kernels generated this way are DRF. 474

Figure 8 shows the average run time and memory usage over five runs on logarithmic and linear scales, respectively. For each run, we set a timeout of 90s and we exclude any run that times out or reports a false alarm. Cutoffs in the memory plots are determined by the cutoffs in the run time plots.

Overall Faial is the most scalable tool. In 4 out of 5 categories, Faial has 479 the slowest growth for all experiments, and verifies all tests within 0.46s. In the 480 largest problem sizes, our tool is the fastest in 3 categories (access, conditional, 481 unsynchronized loop), 2<sup>nd</sup> for barriers, and 3<sup>rd</sup> for synchronized loops. Overall, 482the memory usage of Faial is competitive with other tools. Faial is the only tool 483 with a near constant time/memory for up to 50 unsynchronized loops, indicating 484 the scalability of reducing unsynchronized loops to universally quantified formu-485las. Faial only times out for kernels which consists of >17 nested synchronized 486 loops. However such kernels are uncommon, e.g., the levels of nested synchro-487 nized loops in the real-word kernels studied in Claim 3 are at most 3. 488

| $\begin{array}{l} // \; accesses \\ rd[\;tid\;+\;n_1* T\;]; \\ wr[tid\;+\;1* T\;]; \\ rd[\;tid\;+\;n_2* T\;]; \\ wr[tid\;+\;2* T\;]; \\ //\;\dots \end{array}$ | <pre>// barriers wr[tid ]; sync; wr[tid ]; sync; //</pre> | // conditionals<br>if tid==0<br>{wr[tid]};<br>if tid==1<br>{wr[tid]};<br>// | $\label{eq:constraint} \begin{array}{l} // \text{ unsynchronized loops} \\ \text{for}^{U} \; \mathbf{i_1} \; \mathbf{in} \; \mathbf{0N} \; \{ \\ \text{wr[tid]}; \\ \text{for}^{U} \; \mathbf{i_2} \; \mathbf{in} \; \mathbf{0N} \; \{ \\ \text{wr[tid]}; \\ // \; \cdots \; \} \} \end{array}$ | $\label{eq:1.1} \begin{array}{llllllllllllllllllllllllllllllllllll$ |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------|-----------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------|
|----------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------|-----------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------|

Fig. 7: Synthetic protocols generated for Claim 2. N is a free thread-global variable, and  $n_1,\,n_2\ldots$  are positive integer literals.



Fig. 8: Results for Claim 2. Run time (left plots) are given on a logarithmic scale, and memory (right plots) are given on a linear scale. Flatter and lower curve is better. Tools annotated with a triangle are excluded due to timeouts or errors.

#### 18T. Cogumbreiro, J. Lange, D. Liew Z.R., and H. Zicarelli

GPUVerify remains stable in the barrier and conditional categories but is af-489 fected negatively by loops and accesses. Loops are a known bottleneck in GPUVer-490 ify [2]. In the access category there is an exponential slowdown due to GPUVerify 491 keeping track of what data is being written to/read from array. 492

PUG tool remains stable with the number of barrier synchronizations but is 493 affected negatively by the number of conditionals and loops. PUG is the fastest 494 tool with smaller inputs, but it raises false alarms in the access category, hence 495these measurements are omitted from the corresponding plots. 496

We discuss GKLEE and SESA together since SESA processes GKLEE's NVCC 497 byte code output by concretizing variables, before passing it to GKLEE itself. 498 There are two main factors that affect negatively these symbolic execution tools: 499 (i) the number of loops, since they unroll each loop; and (ii) the amount of book-500 keeping required to keep track of what is read from/written to memory. Figure 8 501 shows clear exponential curves for the access and barrier synchronization cate-502 gories. Observe that these tools timeout immediately in the loop categories. 503

#### Claim 3: Real-World Usability 504

TCnoteR1: what is the unsupported kernel by our tool? We evaluate the usability 505 of our approach by comparing Faial with other static verification tools (GPUVerify 506 and PUG) on real-world kernels wrt. rate of false alarm and run time. We curated 507 a set of CUDA kernels from [2], which consists of 3 benchmark suites (totaling 508 227 CUDA kernels): NVIDIA GPU Computing SDK v2.0 (8 CUDA kernels); 509 NVIDIA GPU Computing SDK v5.0 (166 CUDA kernels); Microsoft C++ AMP 510Sample Projects (20 kernels); gpgpu-sim benchmarks (33 kernels). All kernels are 511 DRF and have been pre-processed by the authors of [2] to facilitate verification. 512Each kernel is in a distinct file, all dependencies are available, and kernels are 513annotated with minimal pre-conditions to allow for automatic analysis (e.q.,514thread count is given). 515

As we aim to evaluate fully automatic verification of three tools, we removed 516code annotations (pre-conditions and loop invariants) specific to GPUVerify. Ad-517 ditionally, we made minor changes to some kernels to meet the limitations of 518 the front-end of Faial and PUG. For instance we converted nested array lookups 519 to use temporary variables and inlined functions calls that operate on arrays in 520 22 kernels. Another 8 kernels were modified to simplify their control flows. Our 521curated dataset will be included in our artifact submission. 522

Figures 9a, 9b, and 9c give the correctness results of Faial, GPUVerify, and 523 PUG, respectively. Correct refers to the true-positive rate, *i.e.*, when the tool 524 correctly identifies the kernel as DRF. False Alarm refers to the false alarm rate, 525*i.e.*, when the tool incorrectly identifies the kernel as racy. A kernel is Unsupported 526if it makes the tool crash. A Timeout occurs when the tool exceeds the limit of 527 60s to verify a kernel. The values shown are an average calculated over five runs. 528Figure 9d shows the average run time and memory usage of every true-positive 529 report (we omit invalid reports) across the three tools. 530

Overall Faial has the highest rate of true-positives at 96%. Our tool is second 531 in terms of run time and memory usage, showing a good compromise w.r.t. time 532



(d) Run time (top) and memory usage (bottom) of true-positives. Time (resp. memory) is cropped at 10s (resp. 100MB) and plotted on a logarithmic (resp. linear) scale.

Fig. 9: Results for Claim 3, on a set of 227 DRF CUDA kernels.

and space. Faial verifies most kernels within 1s, and all kernels that need more time are only verified by Faial. GPUVerify shows lower memory usage at the cost of a higher verification run time. PUG verifies the lowest number of kernels (34.8%), as most kernels are unsupported (62.6%).

## 537 7 Related Work

538 SMT-based DRF analyses Li and Gopalakrishnan propose a direct encoding of 539 DRF analysis of GPU programs in SMT, with PUG [23,24]. Both PUG and Faial 540 follow a similar approach of barrier splitting: having a symbolic representation 541 of a canonical interleaving, and dividing up the analysis over barrier intervals. 542 The two major distinctions are that (1) PUG misses inter-thread data-races in

synchronized loops, e.g., Listing 2.1, and (2) the algorithms of PUG are unspeci-543 fied and lack soundness proofs. In  $[23, \S6.3]$  the authors identify the challenge of 544 detecting inter-thread data-races, but do not elaborate a solution. Ma et al. [30] 545present a similar technique to detect data-races and deadlocks in OpenMP pro-546grams (CPU-based parallelism). However, their work does not guarantee DRF, 547and they do not formalize their algorithms. In [8], Prasanth *et al.* propose a 548polyhedral encoding of DRF for OpenMP programs, which is only applicable to 549programs with affine array accesses. However the prevalence of linearized array 550 expressions in GPU kernels is known to stump polyhedral analysis [15]. 551

Hoare-logic-based DRF analyses The main drawback of Hoare-logic based tools 552is their high rate of false alarms. They also require code annotations from a 553 concurrency expert to handle loops. GPUVerify [2, 3, 5, 6, 11] can verify CUDA 554and OpenCL kernels using Boogie [4] as a backend. GPUVerify also relies on a 555two-thread abstraction (pen and paper proof) — in this paper, we present the 556first machine-checked proof of the two-thread abstraction idea. VeriCUDA [19,20] 557 focuses on reasoning about the functional correctness of GPU programs using 558 Hoare-logic. In [21] the authors extend VeriCUDA to proving DRF. In a sim-559 ilar vein, VerCors [7] uses separation logic to prove the functional correctness 560 and DRF of GPU kernels. Both VeriCUDA and VerCors expect a tool-specific 561language, hence cannot handle real-world kernels directly. 562

Data-race finders include: dynamic data-race detection, symbolic-execution, and 563 model-checking. Such techniques are better suited for highly detailed analysis 564 in smaller kernels, and typically are unable to prove DRF. Dynamic data-race 565 detection executes a kernel to find data-races on a fixed input, e.g., [13, 17, 18, 566 27, 28, 32, 39, 40]. This technique only reports real data-races, but suffers from 567 a slowdown of at least  $10\times$  and requires the kernel input data, which might 568 be unavailable or unknown. Symbolic execution and model checking have been 569 extended to detect data-races [9, 10, 25, 33, 38]. These techniques do without the 570 kernel input and can detect more data-races than dynamic data-race detection. 571

*Miscellaneous* Ferrel *et al.* introduce a machine-checked formalism to reason about the semantics of CUDA assembly [14]. Dabrowski *et al.* mechanize the DRF-analysis of multithreaded programs [12]. Muller and Hoffmann present a logic to reason about the evaluation cost of CUDA kernels [31].

Session types [35] are a family of behavioral types that codify the message exchanges that take place over a given *session*, in terms of sends and receives. Access memory protocols are similar in that they codify the interactions that take place over a given shared *array*, in terms of reads and writes. Other behavioral types have been used to verify parallel and multithreaded systems that communicate via message-passing [29, 36, 37]. However these do not capture shared memory (only message-passing), thus cannot address data-races.

#### 583 8 Conclusion

<sup>584</sup> We tackle the problem of statically checking DRF in GPU kernels, with a new family of behavioral types, *i.e.*, access memory protocols. We provide a novel compositional analysis of access memory protocols, along with fully mechanized proofs and an implementation. Our evaluation explores challenging and diverse benchmarks (229 real-world and 258 synthetic kernels) to demonstrate that our approach is more precise (false alarms and missed alarms), scalable (time/memory growth), and usable (real-world kernels correctly verified) than other tools.

#### 591 References

- 1. Ancona, D., Bono, V., Bravetti, M., Campos, J., Castagna, G., Deniélou, P.M., 592 Gay, S.J., Gesbert, N., Giachino, E., Hu, R., Johnsen, E.B., Martins, F., Mascardi, 593 V., Montesi, F., Neykova, R., Ng, N., Padovani, L., Vasconcelos, V.T., Yoshida, 594 N.: Behavioral types in programming languages. Foundations and Trends in Pro-595 gramming Languages 3(2-3), 95-230 (2016). https://doi.org/10.1561/2500000031 596 2. Bardsley, E., Betts, A., Chong, N., Collingbourne, P., Deligiannis, P., Donaldson, 597 A.F., Ketema, J., Liew, D., Qadeer, S.: Engineering a static verification tool for 598 GPU kernels. In: Proceedings of CAV. vol. 8559, pp. 226–242. Springer (2014). 599 https://doi.org/10.1007/978-3-319-08867-9 15 600 Bardsley, E., Donaldson, A.F., Wickerson, J.: KernelInterceptor: Automating GPU 601 3. kernel verification by intercepting kernels and their parameters. In: Proceedings of 602 IWOCL. pp. 1-5 (5 2014). https://doi.org/10.1145/2664666.2664673 603 Barnett, M., Chang, B.Y.E., DeLine, R., Jacobs, B., Leino, K.R.M.: Boogie: A 604 modular reusable verifier for object-oriented programs. In: Proceedings of FMCO. 605 p. 364–387. Springer (2005). https://doi.org/10.1007/11804192 17 606 Betts, A., Chong, N., Donaldson, A.F., Ketema, J., Qadeer, S., Thomson, P., Wick-5. 607 erson, J.: The design and implementation of a verification technique for GPU ker-608 nels. Transactions on Programming Languages and Systems 37(3), 1-49 (2015). 609 https://doi.org/10.1145/2743017 610 6. Betts, A., Chong, N., Donaldson, A.F., Qadeer, S., Thomson, P.: GPUVerify: a 611 verifier for GPU kernels. In: Proceedings of OOPSLA. pp. 113–132. ACM (2012). 612 https://doi.org/10.1145/2384616.2384625 613 7. Blom, S., Huisman, M., Mihelčić, M.: Specification and verification of 614 GPGPU programs. Science of Computer Programming 95(P3), 376–388 (2014). 615 https://doi.org/10.1016/j.scico.2014.03.013 616 8. Chatarasi, P., Shirako, J., Kong, M., Sarkar, V.: An extended polyhedral model 617 for SPMD programs and its use in static data race detection. In: Proceedings of 618 LCPC'16. pp. 106-120. Springer (2017). https://doi.org/10.1007/978-3-319-52709-619 3 10 620 9. Collingbourne, P., Cadar, C., Kelly, P.H.J.: Symbolic testing of OpenCL code. In: 621 Proceedings of HVC. pp. 203–218. Springer (2012). https://doi.org/10.1007/978-622 3-642-34188-5 18 623 10. Collingbourne, P., Cadar, C., Kelly, P.H.: Symbolic crosschecking of floating-624 point and SIMD code. In: Proceedings of EuroSys. pp. 315–328. ACM (2011). 625 https://doi.org/10.1145/1966445.1966475 626 11. Collingbourne, P., Donaldson, A.F., Ketema, J., Qadeer, S.: Interleaving and lock-627
- Collingbourne, P., Donaldson, A.F., Ketema, J., Qadeer, S.: Interleaving and lock step semantics for analysis and verification of GPU kernels. In: Proceedings of
   ESOP. pp. 270–289. Springer (2013). https://doi.org/10.1007/978-3-642-37036 6\_16
- 12. Dabrowski, F., Pichardie, D.: A certified data race analysis for a Javalike language. In: Proceedings of TPHOL, pp. 212–227. Springer (2009).
  https://doi.org/10.1007/978-3-642-03359-9 16

- 22 T. Cogumbreiro, J. Lange, D. Liew Z.R., and H. Zicarelli
- Eizenberg, A., Peng, Y., Pigli, T., Mansky, W., Devietti, J.: BARRACUDA:
  Binary-level Analysis of Runtime RAces in CUDA programs. In: Proceedings of
  PLDI. pp. 126–140. ACM (2017). https://doi.org/10.1145/3062341.3062342
- Ferrell, B., Duan, J., Hamlen, K.W.: CUDA au Coq: A framework for machine validating GPU assembly programs. In: Proceedings of DATE. pp. 474–479 (2019).
   https://doi.org/10.23919/DATE.2019.8715160
- Grosser, T., Ramanujam, J., Pouchet, L.N., Sadayappan, P., Pop, S.: Optimistic
   delinearization of parametrically sized arrays. In: Proceedings of ICS. pp. 351–360.
   ACM (2015). https://doi.org/10.1145/2751205.2751248
- 16. ul Hassan Khan Khan, A., Al-Mouhamed, M., Fatayer, A., Almousa, A.,
  Baqais, A., Assayony, M.: Padding free bank conflict resolution for CUDAbased matrix transpose algorithm. In: Proceedings of SNPD. pp. 1–6 (2014).
  https://doi.org/10.1109/SNPD.2014.6888709
- <sup>647</sup> 17. Holey, A., Mekkat, V., Zhai, A.: HAccRG: Hardware-accelerated data
  <sup>648</sup> race detection in GPUs. In: Proceedings of ICPP. pp. 60–69 (2013).
  <sup>649</sup> https://doi.org/10.1109/ICPP.2013.15
- Kamath, A.K., George, A.A., Basu, A.: ScoRD: A scoped race detec tor for GPUs. In: Proceedings of ISCA. pp. 1036–1049. IEEE (2020).
   https://doi.org/10.1109/ISCA45697.2020.00088
- Kojima, K., Igarashi, A.: A Hoare logic for SIMT programs. In: Proceedings of
   APLAS. vol. 8301, pp. 58–73. Springer (2013). https://doi.org/10.1007/978-3-319 03542-0 5
- Kojima, K., Igarashi, A.: A Hoare logic for GPU kernels. Transactions on Compu tational Logic 18(1), 1–43 (2017). https://doi.org/10.1145/3001834
- Kojima, K., Imanishi, A., Igarashi, A.: Automated verification of functional correct ness of race-free GPU programs. Journal of Automated Reasoning 60(3), 279–298
   (2018). https://doi.org/10.1007/s10817-017-9428-2
- Lattner, C., Adve, V.: LLVM: A compilation framework for lifelong program
   analysis & transformation. In: Proceedings of CGO. pp. 75–88. IEEE (2004).
   https://doi.org/10.1109/CGO.2004.1281665
- 4 23. Li, G., Gopalakrishnan, G.: Scalable SMT-based verification of GPU
  4 kernel functions. In: Proceedings of FSE. pp. 187–196. ACM (2010).
  4 https://doi.org/10.1145/1882291.1882320
- Li, G., Gopalakrishnan, G.: Parameterized verification of GPU kernel programs. In: Proceedings of IPDPSW. pp. 2450–2459 (2012).
  https://doi.org/10.1109/IPDPSW.2012.302
- <sup>670</sup> 25. Li, G., Li, P., Sawaya, G., Gopalakrishnan, G., Ghosh, I., Rajan, S.P.: GKLEE:
  <sup>671</sup> Concolic verification and test generation for GPUs. In: Proceedings of PPoPP.
  <sup>672</sup> vol. 47, pp. 215–224. ACM (2012). https://doi.org/10.1145/2370036.2145844
- 673 26. Li, P., Li, G., Gopalakrishnan, G.: Practical symbolic race checking
  674 of GPU programs. In: Proceedings of SC. pp. 179–190. IEEE (2014).
  675 https://doi.org/10.1109/SC.2014.20
- 27. Li, P., Ding, C., Hu, X., Soyata, T.: LDetector: A low overhead race detector for
   GPU programs. In: Proceedings of WoDet (2014), http://wodet.cs.washington.
   edu/wp-content/uploads/2014/02/wodet2014-final14.pdf
- 28. Li, P., Hu, X., Chen, D., Brock, J., Luo, H., Zhang, E.Z., Ding, C.: LD: Lowoverhead GPU race detection without access monitoring. Transactions on Architecture and Code Optimization 14(1), 1–25 (2017). https://doi.org/10.1145/3046678
- 29. López, H.A., Marques, E.R.B., Martins, F., Ng, N., Santos, C., Vasconce-
- 683 los, V.T., Yoshida, N.: Protocol-based verification of message-passing par-

| 684 |     | allel programs. In: Proceedings of OOPSLA. pp. 280–298. ACM (2015).                 |
|-----|-----|-------------------------------------------------------------------------------------|
| 685 |     | https://doi.org/10.1145/2814270.2814302                                             |
| 686 | 30. | Ma, H., Diersen, S.R., Wang, L., Liao, C., Quinlan, D., Yang, Z.: Symbolic analysis |
| 687 |     | of concurrency errors in OpenMP programs. In: Proceedings of ICPP. pp. 510–516.     |
| 688 |     | IEEE (2013). https://doi.org/10.1109/ICPP.2013.63                                   |
| 689 | 31. | Muller, S.K., Hoffmann, J.: Modeling and analyzing evaluation cost of CUDA          |
| 690 |     | kernels. Proceedings of the ACM on Programming Languages $5(POPL)$ (2021).          |
| 691 |     | https://doi.org/10.1145/3434306                                                     |
| 692 | 32. | Peng, Y., Grover, V., Devietti, J.: CURD: A dynamic CUDA                            |
| 693 |     | race detector. In: Proceedings of PLDI. pp. 390–403. ACM (2018).                    |
| 694 |     | https://doi.org/10.1145/3192366.3192368                                             |
| 695 | 33. | Pereira, P., Albuquerque, H., Marques, H., Silva, I., Carvalho, C., Cordeiro, L.,   |
| 696 |     | Santos, V., Ferreira, R.: Verifying CUDA programs using SMT-based context-          |
| 697 |     | bounded model checking. In: Proceedings of SAC. pp. 1648–1653. ACM (2016).          |
| 698 |     | https://doi.org/10.1145/2851613.2851830                                             |
| 699 | 34. | Ruetsch, G., Micikevicius, P.: Optimizing matrix transpose in CUDA. NVIDIA          |
| 700 |     | CUDA SDK Application Note 18 (2009), https://www.cs.colostate.edu/                  |
| 701 |     | ~cs675/MatrixTranspose.pdf                                                          |
| 702 | 35. | Takeuchi, K., Honda, K., Kubo, M.: An interaction-based language and its typing     |
| 703 |     | system. In: Proceedings of PARLE. LNCS, vol. 817, pp. 398–413. Springer (1994).     |
| 704 |     | https://doi.org/10.1007/3-540-58184-7_118                                           |
| 705 | 36. | Vasconcelos, V.T.: Session types for linear multithreaded functional                |
| 706 |     | programming. In: Proceedings of PPDP. pp. 1–6. ACM (2009).                          |
| 707 |     | https://doi.org/10.1145/1599410.1599411                                             |
| 708 | 37. | Vasconcelos, V.T., Ravara, A., Gay, S.: Session types for functional mul-           |
| 709 |     | tithreading. In: Proceedings of CONCUR. pp. 497–511. Springer (2004).               |
| 710 |     | https://doi.org/10.1007/978-3-540-28644-8_32                                        |
| 711 | 38. | Wu, M., Ouyang, Y., Zhou, H., Zhang, L., Liu, C., Zhang, Y.: Simulee: Detecting     |
| 712 |     | CUDA synchronization bugs via memory-access modeling. In: Proceedings of ICSE.      |
| 713 |     | pp. 937–948. ACM (2020). https://doi.org/10.1145/3377811.3380358                    |
| 714 | 39. | Zheng, M., Ravi, V.T., Qin, F., Agrawal, G.: GRace: A low-overhead mechanism        |
| 715 |     | for detecting data races in GPU programs. In: Proceedings of PPoPP. pp. 135–146.    |
| 716 |     | ACM (2011). https://doi.org/10.1145/1941553.1941574                                 |
| 717 | 40. | Zheng, M., Ravi, V.T., Qin, F., Agrawal, G.: GMRace: Detecting data races in GPU    |
| 718 |     | programs via a low-overhead scheme. Transactions on Parallel and Distributed        |
| 719 |     | Systems <b>25</b> (1), 104–115 (2014). https://doi.org/10.1109/TPDS.2013.44         |

24T. Cogumbreiro, J. Lange, D. Liew Z.R., and H. Zicarelli

#### **Demonstration of Faial** Α 720

Here we present a demonstration of Faial in three examples. 721

*Example 3.* The following is command-line usage of Faial on Listing 2.3 (l.h.s). 722 In addition to CUDA kernels, Faial allows for the evaluation of access memory 723 protocols stored in the proto file type as shown. Barrier aligning is also shown. 724

```
// Source protocol
        $ cat inter – iteration . proto
726
        shared tile ;
727
        const global N,
728
                global M,
729
                local tid,
730
                where distinct [tid] &&
731
               N > 0 \&\& M > 0;
732
733
        \begin{array}{l} \mbox{foreach (r in 0.. N) } \\ \mbox{foreach (i in 0.. M) } \\ \mbox{frw tile[tid]; } \end{array}
734
735
736
           sync:
          foreach (j in 0.. M) { ro tile [tid + j]; }
737
        }
738
739
         // Step 3: aligned protocol
740
        $ faial -A --- steps 3 inter--iteration.proto
741
742
        ; a-lang
743
          a—prog 1
         locations : tile ;
744
         globals : N, M;
745
746
         locals : tid ;
         invariant : (proj(T1, tid) != proj(T2, tid) \& N > 0) \& M > 0;
747
748
        code \{
749
750
             sync;
             foreach (i in 0 .. M) {
751
752
                  rw tile [tid];
753
             }
754
             sync;
             foreach* (r in 1 .. N) {
foreach (j in 0 .. M) {
755
756
757
                       ro tile [tid + j];
758
                  foreach (i in 0 .. M) {
759
760
                       rw tile [tid];
                  }
761
762
                  sync;
763
             foreach (j in 0 .. M) {
764
765
                  ro tile [tid + j];
766
             }
             sync;
767
768
        }
        ; end of a-lang
769
770
         // Final analysis
771
        $ faial inter - iteration . proto
*** DATA RACE ERROR ***
772
773
774
        Array: tile [1]
775
         T1 mode: R
776
        T2 mode: W
777
778
779
         Globals Value
780
781
         М
                    2
782
```

725

783 784 Ν 2 785 786 r 1 787 788 789 790 Locals T1 T2 791 i 0 0 792 793 1 0 794 j 795 tid 0 1 796 797

Example 4. To show intermediate languages within Faial, we now run Example 1
 from Section 4.2. Here we show the inlined protocol, aligned protocol, flattened
 phases, and generated booleans.

```
// Source protocol
$ cat example-1.proto
801
802
        shared A:
803
        const local tid
804
               where distinct [tid];
805
806
        rw A[tid + 1];
ro A[tid + 2];
807
808
809
        sync;
810
         // Step 1: inline assignments and replace key-values
811
        $ faial -A --- steps 1 example-1.proto
812
         locations : A;
813
814
         globals : ;
         locals : tid ;
815
         invariant : proj($T1, tid) != proj($T2, tid);
816
817
818
        code {
             rw A[1 + tid];
ro A[2 + tid];
819
820
821
             sync;
822
        }
823
824
         // Step 3: aligned protocol
825
        $ faial -A --- steps 3 example-1. proto
826
          a—lang
827
         ; a—prog 1
         locations : A;
828
         globals : ;
829
         locals : tid ;
830
         invariant : proj($T1, tid) != proj($T2, tid);
831
832
        code {
833
834
             sync;
             rw A[1 + tid ];
ro A[2 + tid ];
835
836
             sync;
837
             sync;
838
        }
839
        ; end of a-lang
840
841
        // Step 6: flatten phases
$ faial -A --steps 6 example-1.proto
842
843
        ; flatacc
844
         : acc 1
845
         location : A;
846
```

```
847
                          locals : tid ;
848
                         pre: true;
849
                         {
                                       rw[1 + tid] if proj($T1, tid) != proj($T2, tid);
ro[2 + tid] if proj($T1, tid) != proj($T2, tid);
850
851
852
                         }
                         ; end of flatacc
853
854
                          // Step 7: generate booleans
855
                         $ faial -A --- steps 7 example-1.proto
856
                         ; symbexp
857
                           ; bool 1
858
                          array: A
859
                           predicates :
860
                          decls: tid$T2, tid$T1, $T2$mode, $T2$idx$0, $T1$mode, $T1$idx$0;
861
                         goal: ((tid$T1 != tid$T2 && ($T1$mode ==1 && $T1$idx$0 ==1 +tid$T1)) || (tid$T1 != tid$T2 &&
862
                                              (($T1$mode ==0 && $T2$mode ==1) && $T1$idx$0 ==2 +tid$T1))) && (((tid$T1 != tid$T2
863
                                             \begin{cases} (11) 100 - 0 & (11) 100 - 100 & (11) 100 - 100 & (11) 100 - 100 & (11) 100 - 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & (11) 100 & 
864
865
                         ; end of symbexp
866
867
                           // Final analysis
868
                         $ faial example-1.proto
869
                          *** DATA RACE ERROR ***
870
871
                         Array: A[2]
872
                          T1 mode: Ŵ
873
                         T2 mode: R
874
875
876
                            Locals T1 T2
877
878
                            tid
                                                        1 0
879
880
```

*Example 5.* To demonstrate usage on a real-world CUDA kernel, Faial is run on a matrix transpose from [34], *c.f.*, Listing 2.1 (l.h.s.). The inferred protocol is also shown. Note that Faial handles the representation of multiple arrays in the same protocol, but each array is analyzed independently.

```
// CUDA kernel source
885
          $ cat transposeCoalesced.cu
886
          #include <cuda.h>
887
888
          #define TILE DIM 16
889
890
          #define BLOCK_ROWS 16
891
             _global____void kernel (float* odata, float* idata, int width, int height, int nreps) {
892
893
             __requires(height == 2048);
894
             ___requires(width == 2048);
895
896
             __shared __ float tile[TILE_DIM][TILE_DIM];
897
898
             int xIndex = blockIdx.x * TILE_DIM +threadIdx.x;
int yIndex = blockIdx.y * TILE_DIM +threadIdx.y;
int index_in = xIndex + (yIndex)*width;
899
900
901
902
             xIndex = blockIdx.y * TILE_DIM +threadIdx.x;
yIndex = blockIdx.x * TILE_DIM +threadIdx.y;
903
904
             int index_out = xIndex + (yIndex)*height;
905
906
            for (int r=0; r < nreps; r++) {
    for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) {
        tile [threadIdx.y+i][threadIdx.x] = idata[index_in+i*width];
        }
</pre>
907
908
909
910
```

```
911
                               __syncthreads();
912
913
                               for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) {</pre>
914
915
                                   odata[index out+i*height] = tile [threadIdx.x][threadIdx.y+i];
916
                             }
917
                       }
918
                  }
919
                    // Inferred protocol
920
                  $ faial -b 16,16 -g 64,64 -A --steps 0 transposeCoalesced.cu
921
                    arrays: tile, odata, idata;
922
                    scalars : width, nreps, height, gridDim.y, gridDim.x, blockIdx.y, blockIdx.x, blockDim.y,
923
924
                                  blockDim.x;
925
                                (((( proj(T1, threadIdx.x) != proj(T2, threadIdx.x) || proj(T1, threadIdx.y) != proj(T2, threadIdx.x) || proj(T2, threadIdx.x) || proj(T2, threadIdx.y) != proj(T2, threadIdx.x) || proj(T2, threadIdx.x) || proj(T2, threadIdx.x) || proj(T2, threadIdx.y) != proj(T2, threadIdx.x) || proj(T2, threadIdx.x) || proj(T2, threadIdx.y) || proj(T2, threadIdy.y) || proj(T2, threa
                  pre:
926
                                   threadIdx.y)) && blockIdx.x < gridDim.x) && threadIdx.x < blockDim.x) && blockIdx.y <
                                  gridDim.y) && threadIdx.y <blockDim.y;
927
928
                  code \{
929
                                         local threadIdx.y;
930
                                        local threadIdx.x;
931
                                         assert (height == 2048)
assert (width == 2048)
932
933
                                         local xindex = (16 * blockIdx.x) + threadIdx.x;
934
                                         local yIndex = (16 * blockIdx.y) + threadIdx.y;
935
                                         local index_in = xIndex + (yIndex * width);
local xIndex = (16 * blockIdx.y) + threadIdx.x;
local yIndex = (16 * blockIdx.x) + threadIdx.y;
936
937
938
                                         local index_out = xIndex + (yIndex * height);
939
                                       foreach (r in 0 .. nreps) {
    foreach (i in 0 .. 16; i + 16) {
        rw tile [threadIdx.y + i, threadIdx.x];
        ro idata [index_in + (i * width)];
    }
}
940
941
942
943
                                                              }
944
945
                                                              svnc:
                                                             foreach (i in 0 .. 16; i + 16) {
    rw odata[index_out + (i * height)];
    ro tile [threadIdx.x, threadIdx.y + i];
946
947
948
949
                                                              }
                                        }
950
951
                  }
952
                    // Final analysis
953
                  $ faial -b 16,16 -g 64,64 transposeCoalesced.cu
954
                   *** DATA RACE ERROR ***
955
956
957
                  Array: tile [15, 14]
T1 mode: W
958
959
                   T2 mode: R
960
961
962
                     Globals
                                                      Value
963
964
                     blockIdx.x 0
965
966
                     blockIdx.y 0
967
                                                      2
968
                     nreps
969
                                                     1
970
                     r
971
972
973
974
                     Locals
                                                        T1 T2
975
                    i
                                                        0 0
976
977
                     i1
978
                                                        0 0
```

979 \_\_\_\_\_

982 threadIdx.y 14 15 983 -----