Scalable FSM parallelization via path fusion and higher-order speculation

Finite-state machine (FSM) is a fundamental computation model used by many applications. However, FSM execution is known to be “embarrassingly sequential” due to the state dependences among transitions. Existing solutions leverage enumerative or speculative parallelization to break the dependences. However, the efficiency of both parallelization schemes highly depends on the properties of the FSM and its inputs. For those exhibiting unfavorable properties, the former suffers from the overhead of maintaining multiple execution paths, while the latter is bottlenecked by the serial reprocessing among the misspeculation cases. Either way, the FSM parallelization scalability is seriously compromised. This work addresses the above scalability challenges with two novel techniques. First, for enumerative parallelization, it proposes path fusion. Inspired by the classic NFA to DFA conversion, it maps a vector of states in the original FSM to a new (fused) state. In this way, path fusion can reduce multiple FSM execution paths into a single path, minimizing the overhead of path maintenance. Second, for speculative parallelization, this work introduces higher-order speculation to avoid the serial reprocessing during validations. This is a generalized speculation model that allows speculated states to be validated speculatively. Finally, this work integrates different schemes of FSM parallelization into a framework—BoostFSM, which automatically selects the best based on the relevant properties of the FSM. Evaluation using real-world FSMs with diverse characteristics shows that BoostFSM can raise the average speedup from 3.1× and 15.4× of the existing speculative and enumerative parallelization schemes, respectively, to 25.8× on a 64-core machine.


INTRODUCTION
As a basic computation model, finite-state machine (FSM) embodies many important applications, ranging from intrusion detection [6,27,53,64] and data decoding [26,52] to motif searching [11,49], rule mining [62], and textual data analytics [12,15,36]. However, the execution of an FSM is known to be łembarrassingly sequentialž [5,67], due to the inherent dependences among state transitionsÐ in each state transition, the current state always depends on the prior state 1 . These state dependences fundamentally limit the performance of FSM-based computations on modern processors, where parallelism plays an increasingly critical role.
To address the inherent dependences in FSM computations, prior works [16,24,33,41,42,66,67] fall into two basic parallelization schemes: (i) state enumeration and (ii) state speculation. Figure 1-(b) summarizes them. Without loss of generality, assume the input to an FSM (e.g., a binary sequence) is partitioned evenly into two chunks, as shown in Figure 1-(a). Due to the dependences among state transitions, the starting state for the second chunk would be unknown, until the first chunk has been processedÐthe ending state of the first chunk is the starting state of the second chunk. To process the two chunks in parallel, one can choose: (1) State Enumeration. As the unknown starting state must be one of the states in the FSM, we can enumerate all of them by forking an execution path for each state [16,33], but maintaining all the execution paths may bring significant State Enumeration [16,27] State Speculation [33,34,57,58] Solution (this work)

Dependence Handling
Fork an execution path for each state overhead. To reduce it, prior work [33] checks if some paths transition to the same state, known as path merging, in which case only one of the merged paths needs to be kept. However, the effectiveness of this approach highly depends on the state convergence property of the FSM. When some of the execution paths exhibit slow convergence or fail to converge, the overhead of this scheme would be high. (2) State Speculation. Instead of considering all the states, one can guess the starting state of the second chunk [41,42,66,67].
To ensure correctness, the predicted state must be validated against the ending state of the prior chunkÐthe ground truth. If the validation fails (i.e., misspeculation), the chunk needs to be reprocessed. However, when the input is partitioned into multiple chunks, the ending state of the prior chunk may not be the ground truth until its own speculation has been validated (with needed reprocessing). These serialized validations form a fundamental scalability bottleneck in the existing speculative FSM parallelization [42].
In addition, a hybrid scheme may choose to enumerate a subset of states [23,63], which in fact inherits both the advantages and limitations of the above two schemes. In summary, the existing FSM parallelization schemes face fundamental scalability challenges.
This work introduces two novel techniques: path fusion and higher-order speculation, to address the scalability challenges in the two basic FSM parallelization schemes, respectively. For state enumeration, we propose to fuse different execution paths into a single path. Note that, unlike path merging, path fusion is not based on the state convergence. Instead, its idea stems from the classic NFA to DFA 2 conversion [2]Ða way to remove the inefficiency of nondeterministic NFA execution. Rather than mapping a subset of NFA states to a DFA state, path fusion encodes a vector of states in the original FSM into a fused state, based on which it generates a fused FSM. Thus, a single execution path of the fused FSM simulates multiple execution paths of the original FSM. In principle, the fused FSM could be much larger than the original. To address this, we also propose to dynamically generate a partial fused FSM which 2 Nondeterministic finite automaton and deterministic finite automaton.
captures the states and transitions only for the current input to minimize the memory requirement.
For state speculation, to address the bottleneck of sequential validations, we introduce the concept of speculation order and show that the existing FSM speculation solutions are in fact of the first order. By raising the speculation to higher orders, we find not only that the validations can be naturally parallelized, because chunks of higher-order speculation no longer need to wait for the ground truth, but also that the speculation accuracy might get improvedÐ the validation of higher-order speculation may introduce a better speculated state that is more likely to be the correct starting state (see Section 4.2). Based on these findings, we propose a higher-order iterative speculation scheme which organizes the FSM computations into a series of iterations, gradually improving the accuracies of speculation in a parallel fashion.
Finally, to cover FSMs exhibiting diverse properties, we integrate different parallelization schemes into a multi-scheme parallelization framework, called BoostFSM. Based on a series of heuristics, BoostFSM automatically selects the best parallelization scheme for the given FSM and its inputs. Using a set of FSM benchmarks with various characteristics, our evaluation shows that path fusion improves the speedup of enumerative parallelization from 15.4× to 31.0× (static fusion 3 ) and 18.3× (dynamic fusion) on a machine with 64 cores; for speculative parallelization, high-order speculation raises the speedup from 3.1× to 19.5×. With parallelization scheme selection, BoostFSM achieves 25.8× speedup on average.
In summary, this work makes a three-fold contribution.
• First, it proposes static and dynamic path fusion techniques to reduce the overhead of maintaining multiple execution paths in enumerative FSM parallelization (Section 3). • Second, it introduces higher-order speculation in the context of FSM parallelization and designs an iterative speculation scheme to address the serial validation bottleneck in the existing speculative FSM parallelization (Section 4). • Finally, this work offers a set of heuristics to help select the parallelization scheme for the given FSM (Section 5) and confirms the effectiveness of the proposed techniques with a systematic evaluation (Section 6).

BACKGROUND
We first provide the background of this work.

FSM and Its Dependences
As shown in Figure 2-(a), an FSM can be represented as a directed graph, where nodes represent states, edges represent transitions among states, and labels on the edges indicate the conditions for the transitions to happen. The transitions can be stored in the memory as a transition table, as shown in Figure 2-(b). The size of the table is N × |Σ|, where N is the number of states and |Σ| is the number of symbols (Σ is known as the alphabet). As shown in Figure 2-(c), an FSM execution starts from the initial state (S 0 ) and makes transitions by consuming input symbols one by one. Once moving into an accept state (a node with double circles), the FSM may trigger some action, like emitting a code in Huffman decoding [26] or incrementing a counter in pattern matching [64]. According to its execution model, every state transition in the transition sequence depends on not only the corresponding input symbol but also the prior state. Together, they form a dependence chain, inherently preventing the FSM from running in parallel. Many prior studies [16,24,33,41,42,66,67] tried to łbreakž the dependence chain. Despite the differences in detail, they fall into two basic categories: state enumeration and state speculation. Next, we elaborate each of them with examples.
connect path merging

State Enumeration
Assume we partition the input to an FSM into two chunks, as shown in Figure 3. For the first chunk, we start the FSM execution from the initial state S 0 . But, for the second chunk, we do not know its starting state, as it depends on the processing of the first chunk. The basic idea of state enumeration [16,33] is to fork an execution path for each state in the FSM. Certainly, one of the paths must be correct. As demonstrated in Figure 3, S 1 is later found to be the actual starting state, based on the ending state of chunk_0. Hence, its execution path will be finally selected. However, maintaining all execution paths may create significant overhead, which compromises or even outweighs the benefits of parallelization. Prior work [33] observed that, after certain number of transitions, different execution paths may merge. As shown in Figure 3, the paths started with S 0 and S 1 both transition into S 0 after reading the first 0. Later, the path started with S 2 also merges with the rest. After path merging, only one of them needs to be maintained, thus lowering the overhead. However, the effectiveness of path merging highly depends on the state convergence properties of the FSM. In fact, for many real-world FSMs, most states tend to converge quickly, but a few states fail to converge after a large number of transitions [33]. As an example, Figure 4 shows an FSM slightly different from that in Figure 2, but no states in this FSM would converge for any given input.
One way to address the limitation of poor state convergence is to explore SIMD parallelism [33]Ðusing different SIMD lanes for Figure 4: FSM Example with Poor Convergence running different FSM execution paths. However, such fine-gained hardware parallelism can be otherwise used to enable extra datalevel parallelizationÐpartitioning the input into more chunks [41]. Moreover, its efficiency is restricted by the SIMD width.
Besides path merging, it is often beneficial to separate the state actions into a second pass after the state enumeration, so that they do not have to be łmulti-versionedž [33]. However, this two-pass processing introduces non-negligible overhead even when the FSM shows ideal state convergence (see Section 6).

State Speculation
Instead of enumerating all the states, the other strategy is to predict the starting state. As illustrated in Figure 5, state S 2 is predicted to be the starting state, from which chunk_1 is processed. However, if the prediction turns out to be incorrectÐmisspeculation, the chunk would need to be reprocessed. In the example from Figure 5, S 1 is later found to be the correct starting state. As a result, chunk_1 gets reprocessed. Luckily, path merging may be detected between the reprocessing path and the speculated processing path. Once they merge, the reprocessing can safely stop.
validate reprocessing

Figure 5: Speculative Parallelization
To predict the starting state for chunk_i, prior work [41,42] runs state enumeration on a suffix of chunk chunk_i −1, namely lookback, then selects the ending state that appears most frequently among the enumerated paths. More principled reasoning of probabilities can further improve the accuracy [67]. The scalability bottleneck in the speculative parallelization lies in its sequential validations. When the input is partitioned into multiple chunks, the validations have to be conducted in order from the second chunk to the last, as shown in Figure 6, because, before Figure 7: NFA and Its Execution the prior chunk is validated (and reprocessed as needed), we are not sure if its ending state is correct. This is less a concern when the speculation accuracy is high or the reprocessing lengths are short. But, when the speculation accuracy drops or the reprocessing paths fail to converge with their speculative paths quickly, the scalability of speculative parallelization could be seriously limited.
In summary, the efficiencies of both FSM parallelization schemes depend on the properties of the FSM and its inputs. For those FSMs exhibiting unfavorable properties, they both suffer from high overhead and poor scalability. In the following, we will introduce two techniques to address the scalability issues in each of the two schemes, namely path fusion and higher-order speculation. After that, we will present a set of heuristics to facilitate the parallelization scheme selection in the presence of FSMs that carry a wide range of diverse properties.

PATH FUSION
This section presents path fusion, a technique that fuses different FSM execution paths into a single path, to boost the efficiency of enumerative parallelization. Note that, unlike path merging (see Section 2), path fusion does not rely on FSM's state convergence property. Instead, its idea is inspired by the classic NFA to DFA conversion [2]. Next, we first provide the intuition of path fusion, then present its basic algorithm, and finally discuss how to adopt it dynamically during the enumerative parallelization.

Intuition
An interesting observation we made is that the enumerative FSM parallelization suffers from a similar kind of inefficiency as NFA execution. As shown in Figure 7, an NFA execution in general needs to track multiple current states (bounded by the total number of states) due to its nondeterministic behaviors, which leads to poor execution efficiency in a way similar to that of the enumerative parallelization. Despite the similarities, there are a couple of key differences between the two scenarios: • First, state enumeration maintains a vector of states (i.e., ordered), each for an FSM execution path. The ordering is essential to selecting the right execution path later during the merging phase. By contrast, an NFA only maintains a subset of states, without any ordering.
• Second, the number of current states in an NFA execution may increase or decrease (see Figure 7), but the number of current states in state enumeration may only decrease, which happens in the cases of path merging.
A well-known solution to the inefficiency of NFA execution is to convert the NFA to an equivalent DFA using the subset construction algorithm [2]. Thus, a natural question is: can we design a similar technique to address the execution inefficiency in state enumeration? Fortunately, we find that, by adopting a worklist-based strategy like the one used in the subset construction algorithm [2], we can generate a new FSM, called fused FSM, whose single execution path simulates multiple execution paths of the original FSM. Next, we explain how to statically (i.e., offline) construct the fused FSM.

Static Path Fusion
A state in a fused FSM corresponds to a vector of states in the original FSM. Like NFA to DFA conversion [2], we can statically construct a fused FSM without any actual inputs. for each input symbol c i do 14: initialize V next [] 15: for each state S j in V do Algorithm. Algorithm 1 presents a worklist-based strategy to construct the fused FSM with states {S 0 , S 1 , · · · , S M } from a given FSM with states {S 0 , S 1 , · · · , S N }. Initially, it maps V 0 , a special state vector [S 0 , S 1 , · · · , S N ] that corresponds to the N enumerated execution paths, to the initial state of the fused FSM S 0 (Line 6). Then, it initializes a worklist with V 0 . After that, the algorithm iteratively removes state vectors from the worklist, computes their next state vectors (V next ) for each symbol c i ∈ Σ (Lines 15ś16), maps new state vectors to new fused states (Line 18), and finally records the fused state transitions (Line 20). In addition, the algorithm creates a map from fused states to state vectors (Line 22) which will be used to decode a fused state back to a state vector.
By only adding new fused states to the worklist (Lines 17 and 20), the algorithm will always terminate, as the number of states in the fused FSM is bounded by the size of the N -dimensional vector space N N . However, in practice, the algorithm usually traverses only a very small fraction of the entire vector space, as we will demonstrate shortly after a quick example.
Fused FSM Fused States à State Vectors Example. Figure 8 shows the fused FSM generated for the FSM example in Figure 4. The fused FSM consists of 6 states, whose IDs follow the order that the states are created. With the fused FSM, the three execution paths on chunk_1 shown in Figure 4 can be reduced to a single execution path of the fused FSM: Later, if S 2 turns out to be the actual starting state of chunk_1, we can then immediately find that the actual ending state of chunk_1 is S 1 , the third element in the state vector that is mapped to S 1 . This shows the importance of using vectors instead of subsets as the states of fused FSMÐpreserving the correspondence between the starting states and ending states. Note that, though the vector space for the FSM example in Figure 4 is 3 3 = 27, the statically generated fused FSM only consists of 6 states. Next, we show this is not a special case, but a prevalent property of the fused FSMs.
Size in Practice. Similar to the NFA to DFA conversion [2], the sizes of the fused FSMs for real-world FSMs are often significantly less than the theoretical bound. To demonstrate this, 377 FSMs from the Snort library [48] are chosen so that the fused FSM for each consists of less than 10 6 states. Figure 9 reports the actual number of fused states (note that logarithmic scales are used on both axes). We find that the sizes of the fused FSMs are usually well below N 3 and even below N 2 , where N is the number of states in the original FSM. These results confirm the feasibility of static fused FSM generation for many real-world FSMs. Correspondingly, the time complexity of Algorithm 1 in practice is often below O(N 4 · |Σ|) and even O(N 3 · |Σ|), because, for each fused state, there are |Σ| different transitions, and for each fused state transition, the algorithm needs N original FSM transitions. Despite the promises of static fused FSM generation, we still found that, for many FSMs, the algorithm fails to generate fused FSMs in 3 minutes or generates fused FSMs with over 1 million states. In general, the size of the fused FSM should fit into the given memory budget. For this reason, we next explore the possibility of dynamic fused FSM generation.

Dynamic Path Fusion
Unlike static path fusion which builds the entire fused FSM for all possible inputs, dynamic path fusion constructs a partial fused FSM that only captures the states and transitions for a single input. In this way, it may reduce the memory needs.
Algorithm. The application of dynamic path fusion resembles the just-in-time (JIT) compilation strategy used in modern compilers. It consists of two execution modes: • basic mode. Given a vector of current states V and an input symbol c, this mode makes individual transitions for each state in the vector to obtain the next state vector V next : In addition, it generates fused states and transitions: , where S and S next correspond to V and V next , respectively. Once it finds a visited state vector V , it will map it to its fused state S and switch to the fused mode. • fused mode. Given the current fused state S, this mode tries to make a fused state transition S next = Trans[S] [c]. If the transition is unavailable, it switches to basic mode. With dynamic path fusion, the state enumeration scheme starts from the basic mode, then switches between the two modes based on the availability of fused state transitions.

Data Structures.
A key design question in implementing the dynamic path fusion is how to store the transition information Trans[S] [c]. A straightforward solution is to use a hash map, where the key is a combination of S and c, and the value is the next fused state S next . While being intuitive, it requires an invocation of a hash function for each fused state transition. Comparing to the transition table (Figure 2-b), we found the cost of hash-map-based state transitions is about 7× higher. Instead, we store the fused state transitions into a vector of arrays. As illustrated in Figure 10-a, each array is of fixed length once allocated, but the vector is extensible at the endÐeach time a new fused state is created, a łrowž is added to the vector of arrays. In addition, each łrowž is indexed by the input symbol ID while the vector is indexed by the fused state ID. Each element in the łrowž stores a pointer (like S * 1 ) to the target state for an input symbol. In theory, if the transitions of an execution are scattered sparsely across many fused states, this data structure may waste space, similar to the transition table. However, in practice, we found that, for a single input, the transitions are often concentrated among a few łhotž states, leading to small memory footprints.
The FusedState shown in Figure 10-b consists of a state id and a pointer to its corresponding state vector to quickly switch back to the basic mode once the fused state transition is unavailable. To find the chances of switching to the fused mode, a hash map from the state vectors to fused states (see Figure 10-c) is maintained. The sizes of the vector and the hash map equal to the number of generated fused states. Figure 11: Example Execution with Dynamic Path Fusion Example. Figure 11 illustrates an execution with dynamic path fusion using the FSM from Figure 4. The thick arrows indicate the switches between the two execution modes. Initially, the execution starts from the basic mode, meanwhile it generates fused states and transitions (in gray color) as it consumes input symbols. After consuming the third symbol ('1'), the execution switches to the fused mode, as it finds that current state vector [S 0 , S 2 , S 1 ] has been observed (after consuming the first symbol '0')Ðits fused state exits (S 1 ). However, after reading the fourth symbol ('0'), it notices that the fused transition for this symbol has not yet been established (i.e., unavailable), so it switches back to the basic mode, then records this fused transition. A similar process repeats until all the input symbols are consumed.
Cost Analysis. In general, the longer the execution stays in fused mode, the more efficiency benefits the dynamic path fusion brings. In fact, we can capture how long the execution stays in basic mode using the number of unique fused state transitions met in the execution, denoted as N uniq , because (i) each unique fused state transition has to be generated in basic mode; and (ii) basic mode only generates each unique fused state transition once. On the other hand, the time spent in the basic mode also depends on the cost of state enumeration for processing each symbol, which is proportional to the state vector size |V |. By default, |V | equals to the number of states in the original FSM, but often can be significantly reduced with some optimization (as we will show shortly). The product between these two factors, N uniq × |V | captures the total cost of execution in basic mode.
In addition, there are also costs of generating the fused states and transitions, as well as the cost of switching between the two execution modes. However, our evaluation shows that they are usually negligible thanks to the relatively small numbers of fused states and unique state transitions.
Optimization. As mentioned earlier, path fusion is different from the path merging optimization (Section 2). In fact, we can integrate the latter into the former to further boost the efficiency. To achieve this, we separate state enumeration into two phases: path merging phase and path fusing phase. In the first phase, most paths tend to merge quickly [33]. Once the number of execution paths is below a threshold τ n or remains unchanged for τ l transitions, we move to the second phase and start dynamic path fusion. In cases where the path merging reduces the size of the state vector, the following dynamic path fusion will consume even less memory and make faster switches between the two execution modes.
So far, we have presented the path fusion for improving the scalability of state enumeration. Next, we move to the other FSM parallelization scheme, state speculation, which also suffers from a critical scalability issue related to the properties of FSMs.

HIGHER-ORDER SPECULATION
As explained in Section 2, the scalability issue in speculative FSM parallelization lies in the sequential validations. In this section, we address this issue by introducing the concept of speculation order. Note that though we are not aware of any existing definition for it, the ideas behind this concept have been intensively studied in the literature, especially in the context of thread-level speculation. More details of prior related work will be given in Section 7. Based on this concept, we show that the existing FSM speculation solutions belong to first-order speculation, and by raising the speculation to higher orders, it is possible to validate different input chunks in parallel, while ensuring the correctness.

Speculation Order
Formally, we denote the speculation at the beginning of chunk_i as: where S is the predicted starting state and C is the corresponding correct starting state, also referred to as the correctness criterion. Speculation Spec(i, S, C) can be validated by replacing the predicted state S with the correctness criterion C: A validation makes the starting state of chunk_i non-speculative. If we refer to the above speculation Spec(i, S, C) as the first-order speculation, we can then generalize the concept of łspeculationž to higher-order speculation, recursively: • first order, if and only if its validation makes the starting state non-speculative, denoted as As shown in Equation 4, the predicted state in Spec k (i, C, C ′ ) is in fact the correctness criterion from Spec k +1 (i, S, C). In another word, the correctness criterion C itself is speculative.
Based on the above formalization, it is not hard to find that all prior FSM speculation techniques [24,41,42,66,67], in fact, belong to first-order speculation, as the correctness criteria used in their validations are always non-speculative. This is the root cause to the sequential validationsÐfirst-order speculation requires all the prior chunks to be non-speculative before it validates the current. Next, we show that by raising the speculation orders of different input chunks, the sequential validation issue can be effectively alleviated.

Benefits of Higher-Order Speculation
In general, raising the speculation order could bring benefits to speculative FSM parallelization in two aspects: • Earlier & meaningful validation. To illustrate this benefit, let us reexamine the conventional (first-order) speculation in Figure 6, where the validation of chunk_3 has to wait for the completion of chunk_2's validation, in order to obtain the non-speculative ending state of chunk_2, S end_2 . However, if we raise the speculation at chunk_3 to the 2nd order, as shown in Figure 12, and use the speculative ending state of chunk_2, S ′ end_2 , as the correctness criterion, then we can immediately start its validation and reprocessing, in parallel with those of chunk_1. If S ′ end_2 turns out to be the correct ending state of chunk_2, like the case in the figure, then the reprocessing of chunk_3 would be valid. In another word, the sequential validations are optimistically parallelized.

Figure 12: Earlier and Meaningful Validation
• Improved speculation accuracy. Besides extra parallelism, the other benefit of higher-order speculation comes from the improved speculation accuracy. Without loss of generality, consider chunk_2 and chunk_3 in Figure 13, whose starting states are predicted with some existing technique [24,41,67], denoted as S ′ start_2 and S ′ start_3 . Statistically speaking, their probabilities of being the correct starting states are the same. After the speculative processing of chunk_2, assume the ending state is S ′ end_2 , then the probability that S ′ end_2 is the correct starting state of chunk_3 might be higher than S ′ start_3 thanks to the potential state convergence during the speculative processing of chunk_2. That is, even if S ′ start_2 is incorrect, its execution path may converge with the correct path, resulting in a correct ending state. If the speculation at chunk_3 is of second order (see Figure 13), where the correctness criterion is S ′ end_2 , then after the validation (i.e., replacing S ′ start_3 with S ′ end_2 ), the speculation accuracy can potentially be increased.
To take the above benefits from the higher-order speculation, we next present a new speculative FSM parallelization model, referred to as iterative speculation.

Iterative Speculation
Unlike existing speculative FSM parallelization [24,41,42,66,67], iterative speculation organizes the speculative FSM execution into a series of iterations. Algorithm 2 summarizes its basic ideas. First, it predicts the starting state for each chunk, just like the conventional Next, we explain how higher-order speculation is reflected in the above algorithm and why the algorithm in fact always terminates within #chunks iterations. Figure 14 uses different grayscale levels to represent different orders of speculation, with the darkest used for non-speculative processing. Initially, we assume that chunks are assigned with increasing orders of speculation: chunk_i is of i-th order speculation. Then, during each iteration, the latest speculation of each chunk is validated using the latest ending state from the prior chunk. As a result, its speculation order gets reduced by at least one. Once its speculation order becomes 0th (i.e., non-speculative), a chunk will stay inactive, as no ending states of its prior chunks are speculative. Obviously, the initial highest speculation order (of the last chunk) determines the maximum iteration number, thus the algorithm takes at most #chunks iterations.
So far, we have explained both path fusion and higher-order speculation. Still, a remaining question is which scheme works the best for a given FSM and its inputs. We address this next.

PARALLELIZATION SCHEME SELECTION
Including the two basic schemes (see Section 2), we have discussed five FSM parallelization schemes in total, denoted as follows: • B-Enum: basic state enumeration • B-Spec: basic state speculation • S-Fusion: state enumeration with static path fusion • D-Fusion: state enumeration with dynamic path fusion • H-Spec: higher-order (iterative) speculation We also refer to the last three as augmented schemes. Which scheme works the best depends on the characteristics of the FSM and its inputs. Based on their designs, we focus the scheme selection on four key properties: (i) state convergence rate, (ii) speculation accuracy, (iii) the feasibility to generate a static fused FSM, and (iv) the skewness factor of fused FSM, where the first and last properties are defined below:  Note that our goal is NOT to precisely model the execution time of each scheme, which could be extremely challenging given the diverse and complex FSM transition behaviors. So, instead, we intend to qualitatively reason about the conditions for each scheme to work well in general, based on which we draw the heuristics to guide the scheme selection.
The decision tree in Figure 15 summarizes the heuristics used for selecting the parallelization scheme. It starts from the most favorable scenario, then moves to the more challenging ones. For speculative parallelization, the most favorable scenario is when the speculation accuracy is high (according to a threshold τ acc ). In this case, B-Spec and H-Spec are the best choices for their negligible overhead 4 ( 1 ). By contrast, even in the most favorable conditions, enumerative schemes still suffer from the overhead of two-pass processing (see Section 2). If the speculation accuracy is not high enough, the next heuristic is to check the state convergence rate conv(l). Even with a low speculation accuracy, H-Spec could still work well as long as the state convergence rate is higher than a 4 Assuming the cost of starting state prediction is negligible.  ( 2 ), thanks to its capability in improving the speculation accuracy (Section 4). Further down the decision tree, if none of the above conditions are met, the next step is to check the feasibility of statically generating a fused FSM. In fact, S-Fusion works the best among the enumerative schemes for its single-path execution and offline fused FSM generation ( 3 ). When this condition is unavailable neither, the last resort is D-Fusion, which works well when the skewness factor skew(l) of the fused FSM is high and the state vector size is small (i.e., high conv(l)). In fact, at this point, the state convergence rate is already unfavorable (see the second heuristic). However, if the combined factor skew(l)× conv(l) is sufficiently high, D-Fusion might become the best option ( 4 ). According to their definitions, 1/(skew(l) × conv(l)) = N uniq × |V |, which captures the major cost of execution for D-Fusion, as shown in its łCost Analysisž (see Section 3.3). Finally, if none of the above conditions are met (i.e., the least favorable situation), one may choose among H-Spec, B-Enum, and D-Fusion( 5 )Ðthe best option depends on the specific values of the relevant FSM properties. More detailed performance modeling may help break the tie, but it would come with extra complexities. Considering the cost of collecting the properties, we target the scheme selection for the given FSM and a group of its inputs, rather than a single input. That is, a few training inputs are (randomly) selected to collect the properties offline, based on which a scheme is selected and used online. In fact, we can instrument D-Fusion and B-Spec to collect these properties, thus the costs of profiling are slightly higher to their running time on the training inputs. For situations where input sensitivity is concerned, we can run the given FSM over a tiny portion (say 0.25%) of the actual input, though this will pay for a proportional amount of runtime overhead.

EVALUATION
In this section, we evaluate the effectiveness of path fusion and higher-order speculation, as well as the scheme selection heursitics.

Methodology
We implemented the five FSM parallelization schemes summarized in Section 5, in C++ language and used Pthread for their parallel executions. Then, we integrated these five schemes along with the scheme selector into one multi-scheme FSM parallelization framework, called BoostFSM. The memory budget for static fused FSM generation is set to 1GB/FSM, or equivalently 10 6 fused states.
Benchmarks. Table 1 lists the FSM benchmarks used in evaluation with their relevant properties. The 16 benchmarks are collected from the Snort library [48], a pool of signatures in PCRE format used by the state-of-the-art Network Intrusion Detection Systems (NIDS). We converted the signatures into FSMs using one of the off-the-shelf regex2DFA tools [1]. They are chosen to cover the diverse properties of FSMs.
The inputs to the FSMs are 20 traces of real-world network traffics collected from a Linux server using tcpdump. Each trace consists of 4 × 10 8 symbols (i.e., 400MB). For each FSM, five traces are randomly selected and their first 10 6 symbols (i.e., 0.25%) are used to collect the properties in Table 1 offline.
Platform. All experiments were performed on a 64-core machine equipped with an Xeon Phi 7210 processor and 96GB RAM, running Linux 3.10.0. All programs were compiled by GCC 4.8.5 with the łO3" flag. The timing results reported are the average of three repetitive runs over 20 inputs (unless specified otherwise). Table 2 reports the speedups of different FSM parallelization schemes using 64 cores over the sequential FSM execution. Note that the sequential FSM execution times are similar across FSMs (the second column), despite their large variation in terms of the number of states (see Table 1). This is because the inputs to different FSMs are of the same size and also the frequently accessed state transitions often well fit into CPU caches. In the following, we first compare the three augmented schemes with each of the two basic schemes 5 , then examine the effectiveness of the scheme selection.

Performance
Static Path Fusion. First, for benchmarks whose static fused FSMs can be generated (M1, M3-4, M8, M11), S-Fusion significantly raises the speedups comparing to B-Enum, from 12.9× to 31.0×  S-Fusion completely avoids such overhead. On the other hand, for those FSMs whose static fused FSMs are too large to generate (i.e., over the memory budget), static path fusion cannot help. In addition, Table 3 reports the sizes of the static fused FSMs and the construction time. More results about the sizes of fused FSM were presented in Section 3.2 and Figure 9.
Dynamic Path Fusion. The speedups of D-Fusion vary a lot across benchmarks, ranging from 3.6× to 25.5×. For some FSMs (M2, M5-7, M12, M16), D-Fusion performs worse than B-Fusion. As discussed in Section 3.3, given an FSM execution with D-Fusion, its efficiency depends on the size of state vector |V | in basic mode and the number of unique fused state transitions N uniq encountered in the execution, which are shown in the second and third columns of Table 4, respectively. Note that the state vector size |V | is the number of remaining active states after the path merging phase (see łOptimization" in Section 3.3). The product N uniq × |V |Ðcapturing the cost in basic mode (see łCost Analysisž in Section 3.3), roughly inversely aligns with the speedups of D-Fusion. Note that M16 is special in that its |V | drops to one during the path merging phase, so no path fusion is needed.
The 4th column of Table 4 lists the numbers of fused states dynamically generated (ranges from 4 to 1209). For 10 out of 16  FSMs, the numbers of fused states are even less than those in the original FSMs, showing high space efficiency in practice.
The last four columns of Table 4 report the time breakdown of D-Fusion, where the first three columns are the time of merging phase (t mer дe ), the time spent in basic mode (t basic ), and the time spent in fused mode (t fused ). For most FSMs, t fused is significantly higher than t basic , indicating that the FSM runs mostly in the fused mode with a single transition path. M12 is the only FSM for which t basic is higher than t fused , which aligns with its high N uniq Ðits execution encounters many different fused state transitions, so it has to frequently switch back to the basic mode. The summation of t merge , t basic , and t fused roughly equals to the total time of the first pass in a two-pass enumerative scheme (see "State Enumeration" in Section 2). Note that the first pass also includes (partial) fused FSM generation and switchings between the two execution modes, however, as the number of dynamically generated fused states and transitions are relatively small (comparing to the input length), the cost is usually negligible, so as to the cost of mode switchings. The last column of Table 4 reports the time spent in the second pass, which counts the number of accept states encountered during an FSM execution. As the second pass naturally runs in parallel, it shows limited variation across FSMs.
Higher-Order Speculation. As shown in Table 2, H-Spec boosts the speedups from 3.1× (B-Spec) to 19.5× on average. Specifically, H-Spec offers better speedups across all benchmarks, except for M8 and M16, in which cases, both H-Spec and B-Spec work very well (over 36× speedups), with B-Spec showing marginally better speedups. These results are consistent with our earlier discussion, that is, H-Spec performs no worse than B-Spec in principle. The improvements come from two benefits of higher-order speculation: (i) earlier and meaningful validations and (ii) improved speculation accuracy (see Section 4.2). Table 5 reports the speculation accuracies of B-Spec and H-Spec. The initial speculation accuracies of H-Spec (Iteration-1) are the same as those of B-Spec (24% on average). But, over iterations, as it introduces new speculated starting states (based on the new ending states of the prior chunks), the speculation accuracies of H-Spec get improved quickly. By the third iteration, all benchmarks reach 100% speculation accuracy. On average, it takes 2.1 iterations for H-Spec to complete the processing. In summary, the augmented schemes substantially boost the speedups over the basic ones. However, their beneifts vary across benchmarks. As shown in Table 2, the best schemes (in bolded font) for the benchmarks scatter across different parallelization schemes, which confirms the needs of scheme selection.
Scheme Selection. The FSM properties used for scheme selection are shown in Table 1. Following the heuristics in Section 5, the selector first checks the speculation accuracy against the threshold τ acc (95%) and finds that only M8 and M16 meet the requirement. Thus, it selects B-Spec for these two FSMs. Then, it checks if the state convergence rate conv(10 6 ) is one (i.e., a single current state is left). If so, it chooses H-Spec, which happens to M2 and M5-7. For the remaining benchmarks, the selector further examines the feasibility to generate a static fused FSM. It obtains positive answers for benchmarks M1, M3-4, and M11, thus assigns S-Fusion to them. Finally, the scheme selector compares the combined factor between the skewness factor and the state convergence rate skew(l)×conv(l) against the threshold (10 −4 ). As a result, the remaining benchmarks who satisfy the requirement (M9 and M13-15) are assigned with D-Fusion. At this point, there are still two benchmarks left: M10 an M12. Since our selector does not further examine the properties based on their specific values, by default, it chooses B-Enum. The last column of Table 2 shows the results of the scheme selection. Out of 16 cases, it only fails to pick the best scheme for M10. The failure is simply due to the fact that the heuristics stops reasoning about the performance at more fined-grained levels, which can be improved with more detailed performance modeling.

Scalability
In this section, we examine the scalability of different schemes in terms of both the number of cores and the input size. in Figure 16. In general, when desired properties are present (see Section 5), all the five schemes can scale well. On the other hand, when the properties are not ideal, some schemes suffer from worse scalabilities than the others. Take B-Spec as an example, when the speculation accuracy drops and the state convergence rate is low, it scales poorly and may even run slower than the serial execution (see the curves of B-Spec in the cases of M1, M2, and M13), due to its serial validations. Another scheme that may not scale well is D-Fusion, as shown in the case of M2. This is because when the input is partitioned into smaller chunks, the number of unique state vectors may not decrease proportionally, thus the overhead becomes relatively higher, compromising the benefits of parallelization. Note that, for some cases, the speedups at 64 cores drop slightly, which is caused by some issue specific to the tested machine.
Varying Input Size. Figure 17 reports the speedups of the five schemes under different input sizes: small (1×10 8 ), medium (4×10 8 ), and large (16×10 8 ). Overall, there are clear trends that the speedups get improved as the input sizes increase for all the parallelization schemes. The trends reflect the Amdhal's law. In our context, the sequential components include thread creation (64 threads), thread synchronization (validations in speculative schemes or correct path selections in enumerative schemes), and I/O operations (printing out results). In addition, with larger inputs, H-Spec may also benefit from better convergence with longer input chunk (see Figure 13). For D-Fusion, as the input chunk becomes longer, the number of switches between the two modes may become relatively less, thus further improve the performance (happened to M7).

RELATED WORK
This section summarizes the related work into three categories: speculative, enumerative, and FSM-related parallelization.
By modifying the architectures, thread-level speculation (TLS) [9,29,38,43,54,68] spawns speculative threads along the dynamic execution path of a single-threaded application. For correctness, TLS must isolate the writes of łmore speculativež threads from the łless speculativež threads and detect the data dependence violations at runtime [38]. As speculation contexts are typically established in a nested wayÐa speculative thread spawns another speculative thread, such architecture-based TLS schemes are naturally of the higher-order speculation as we defined in this work.
More specifically, the idea of high-order speculation is akin to some of the existing ideas like speculative data forwarding and eager recovery from misspeculation [37,44,54], as well as parallel ordered commits [18,19]. Given the existence of these ideas, one contribution of this work is to bring them to solve the scalability problem of parallelizing FSM computations.
Software-based approaches [4,13,25,46,57,58] achieve similar goals with software-managed thread state isolation and runtime data dependence analysis. For example, LRPD [46] speculatively applies privatization and reduction to transform sequential loops into DOALL loops, then validate them with runtime checks. Once failed, the loops would be re-executed sequentially.
In comparison, BOP [13] allows programmers or profiling tools to suggest possibly parallel regions (PPR) and leverages virtual memory (i.e., process mechanism) to protect the address space. Note that BOP defines łspeculation depthžÐa concept relevant to our definition of łhigher-order speculationž. The key difference is that, in BOP, the k-th level speculation is checked after the first k − 1 speculative processes commit, which makes itself essentially first-order speculation. Tian and others [57] further separate the speculative state from the non-speculative and propose a łcopy or discardž model to better manage the memory state in software speculation. According to its thread execution model, a speculative thread only synchronizes with non-speculative thread (called main thread), which makes the solution also first-order speculation. On the other hand, the above software speculation schemes may also be augmented to support higher-order speculation.
Besides architecture and programming system supports, Prabhu and others [39] propose two new language constructs: speculative composition and speculative iteration, for programmers to express speculative parallelism in programs declaratively.
In parallel discrete event simulation (PDES), several optimistic mechanisms [17], such as time warp with lazy cancellation and lazy rollback, also resemble the basic idea of higher-order speculation.
Enumerative Parallelization. By contrast, there are only a few prior works on enumerative parallelization. One reason could be the infeasibility in enumerating all the cases in general programs. Some early works [3,60] studied the potential of enumerating different execution paths under control branches. If FSM transitions are hardcoded, rather than being stored in a transition table, the enumerative FSM parallelization would be similar to branch enumeration. In comparison, N-way programming model [10] enumerates different algorithms or implementations of the same tasks and selects the one that finishes earliest. For more specific application areas, Malki and others [31] leverage the rank convergence property of dynamic programming to enable coarse-grained parallelization, which can be viewed as a form of state enumeration. Similarly, Raychev and others [47] use symbolic execution to parallelize the user-defined aggregations in big data frameworks, where a symbolic value is an abstraction of all the enumerative cases. More related to FSMs, there are a series of works [21,22,35] on enumerative parallelization of pushdown automata, which consist of an FSM and a stack, for processing semi-structured data like XML and JSON.
Other FSM-related Parallelization. In addition to the related FSM parallelization work mentioned in Section 2, there are other works in parallel FSM computations. In particular, various hardware FSM implementations, usually in a form of NFA-like automaton, have been proposed, such as automata processors [14,55,61], cache automata [56], and FlexAmata [51]. In comparison, some other works choose to use GPUs to accelerate FSM computations [8,30,34,59,65,69]. Like hardware FSMs, they also mostly focus on NFAs rather than DFAs for better space efficiency. In a recent work [63], Xia and others propose reduction-style validations to address the scalability limitation in speculative FSM parallelization on GPUs, which essentially is also of higher-order speculation. Besides the conventional character-by-character FSM processing, there are works that leverage bitwise parallelism and/or SIMD operations to accelerate FSM-related applications [7,20,28] or model bitwise/SIMD applications using FSMs to enable (speculative) parallelization [40].
Besides parallelization, the state convergence property in FSMs also makes their executions more tolerable to errors when they are executed in unreliable environments [50].

CONCLUSION
This work targets the scalability issues inherited in the two basic FSM parallelization schemes: (i) the cost of maintaining multiple execution paths in enumerative parallelization and (ii) the serial chunk-by-chunk validations in speculative parallelization. For the former, we propose path fusion, which can fuse different execution paths into a single one, either statically or dynamically, to lower down the runtime cost of enumerative parallelization. For the latter, we introduced higher-order speculation which allows a speculated state to be validated speculatively to enable additional parallelism and improve the speculation accuracy. Furthermore, we presented a set of heuristics to help select the parallelization scheme for practical use. Finally, we evaluated the proposed techniques using real-world FSMs with diverse properties. The results confirmed the effectiveness of the proposed techniques, substantially raising the speedups for a spectrum of FSM benchmarks on parallel processors.

ACKNOWLEDGMENTS
We thank all anonymous reviewers for their constructive comments and our shepherd Dr. Guy Steele for his time and efforts in helping with the paper revision. This material is based upon work supported by the National Science Foundation under Grant No. 1565928 and 1751392. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.

A ARTIFACT APPENDIX A.1 Abstract
This artifact contains the source code of BoostFSM, including the five FSM parallelization schemes discussed in our paper and some benchmarks along with their inputs used for evaluation. In addition, this artifact provides bash scripts to compile the source code and reproduce the key experimental results reported in the paper.
Considering the software dependencies, a software environment with Linux Centos 7 or other similar Linux distributions, GCC, Bash, Pthread, CMake and Boost library, is needed before the evaluation. Moreover, to reproduce all results reported in the paper, especially the speedup comparison and scalability analysis, the artifact needs to run on Intel Xeon Phi processor (Knights Landing/KNL).  A.3.2 Hardware Dependencies. We recommend artifact evaluation on an Intel Xeon Phi architecture (Intel Xeon Phi 7210 with 1.3GHz in particular) to reproduce the results reported in the paper, but it may also be compiled and run well on other Linux machines (yielding different results). At least 20GB space is needed (mainly for the data sets decompression).

A.2 Artifact Check-List (Meta-Information)
A.3.3 Software Dependencies. We recommend that the artifact runs on CentOS 7, but other similar Linux distributions should also work. To compile and run the source code with scripts, users need GCC 4.8.5, CMake 2.8 and Boost 1.66.0 library (or their later versions).
A.3.4 Data Sets. Benchmarks are collected from an open-source network intrusion detection system (Snort), where there is a pool of signatures in PCRE format. The evaluated FSMs are converted from the signatures with using a regular expression to DFA tool. The corresponding data sets are included in this artifact for testing. They are the traces of network traffics collected from a Linux server using tcpdump and zipped into the artifact file. There are totally 20 inputs, with size of about 400MB each.

A.4 Installation
Please ensure software dependencies are met before evaluating the artifact. Users need to download the source code and scripts which are zipped into ASPLOS21_AE.zip from Zendo. There is a script compile.sh under the directory ASPLOS21_AE/ which can be used to compile the source code and generate the executables (please run command bash compile.sh).

A.5 Experiment Workflow
We have provided a script run.sh to generate all the results in one step, but we also support flexible evaluations or manual testing. The total evaluation time of this artifact is about 4 hours (on the recommended KNL machine).
To generate the results in Tables 1, 2, 3, 4, and 5, and Figure 16, users can run the following commands: For Figure 17, users can repeat the evaluation of Table 2, but over inputs with different sizes.

A.6 Evaluation and Expected Result
Results will be printed to the command console after finishing the evaluation for a table or a figure. Following the experiment workflow, users firstly get the properties of evaluated FSMs, then the speedup comparison results among different schemes in BoostFSM. After that, the statistics of path fusion and speculation accuracy for B-Spec and H-Spec are produced. Finally, users can get the scalability results (i.e., speedup curves) reported in Figure 16.

A.7 Experiment Customization
Please follow commands in the compilation and execution scripts to customize the testing. For example, to test the scalability of different parallelization schemes, users can follow the commands in ASPLOS21_AE/scripts/GetFigure16.sh.