Evaluating Linear XPath Expressions by Pattern-Matching Automata

: We consider the problem of eﬃciently evaluating a large number of XPath expressions, especially in the case when they deﬁne subscriber proﬁles for ﬁltering of XML documents. For each document in an XML document stream, the task is to determine those proﬁles that match the document. In this article we present a new general method for ﬁltering with proﬁles expressed by linear XPath expressions with child operators ( / ), descendant operators ( // ), and wildcards ( ∗ ). This new ﬁltering algorithm is based on a backtracking deterministic ﬁnite automaton derived from the classic Aho–Corasick pattern-matching automaton. This automaton has a size linear in the sum of the sizes of the XPath ﬁlters, and the worst-case time bound of the algorithm is much less than the time bound of the simulation of linear-size nondeterministic automata.Ournewalgorithm has a predecessor that can handle child and descendant operators but not wildcards, and has been shown to be extremely eﬃcient when a document-type deﬁnition (DTD) has been used to prune out all the wildcards and most of the descendant operators. But in some cases, such as when the DTD is highly recursive, it may not be possible to prune out all wildcards without producing a too large set of ﬁlters. Then it is important to have the full generality of an evaluation algorithm, as presented in this article, that can also handle wildcards.


Introduction
In a publish-subscribe system based on XML filtering, subscribers usually specify their profiles by filters written in the XPath language.The system processes a stream of published XML documents and delivers to subscribers those documents that match the corresponding profiles.The number of subscribers can be large, thousands or even millions; thus the scalability of the filtering system is critical.
The primary problem addressed in this article is the filtering problem for XML streams: given a set of XPath expressions and a stream of XML documents, determine for each document in the stream those expressions that match the document.More specifically, we study the filtering problem for linear XPath expressions, that is, XPath expressions that do not have branches in their parse tree.Linear XPath expressions without predicates are defined by the following grammar: where label denotes an XML-element label.
Several approaches to XML filtering with XPath-defined profiles use a finite automaton as a basis of the filtering algorithm [Altinel and Franklin 2000, Diao et al. 2003, Green et al. 2004, Gupta and Suciu 2003, Onizuka 2003].Diao et al. [2003] report an evaluation method called YFilter that evaluates nondeterministic finite automata (NFAs) constructed from the filters.YFilter uses a single NFA that combines the effect of the individual NFAs and achieves considerable improvements in performance by path sharing.In other words, YFilter merges states that correspond to common prefixes in different query paths, while still retaining the linear size of the NFA with respect to the filter descriptions.
Contrary to YFilter, which uses an NFA of a size linear in the total size of the filters, the algorithm of Green et al. [2004] is based on a deterministic finite automaton (DFA).The state explosion of the DFA is avoided by constructing the DFA lazily, as needed, while input documents are being filtered: if in processing the stream of XML documents, no next state is defined on the current input symbol, the corresponding new state will be computed and the process is continued at this new state.While exponential in the worst case, this approach works well in many cases, when the incoming XML documents obey a schema or DTD (see e.g.Kilpeläinen and Wood [2001]) that is non-recursive or contains only simple cycles (a cycle is simple if its nodes are not contained in other cycles).
The filtering methods outlined above are general in the sense that input documents do not need to obey a specific schema or DTD, but are only required to conform with the XML syntax.If it is known that the documents to be filtered are produced according to a DTD, a natural question is to ask whether or not the filtering process can be speeded up by using this knowledge.This question has been answered positively both in the case of YFilter [Silvasti et al. 2009a] and in the case of the lazy DFA construction [Silvasti et al. 2009b].
The optimization method of Silvasti et al. [2009a], called filter pruning, takes as input a DTD and a set of linear XPath filters and produces a set of "pruned" linear XPath filters.In the pruned filters, wildcards ( * ) and descendant operators (//) are replaced (or pruned ) by single symbols and symbol strings, respectively, that are allowed in place of " * " and "//".
In this article we present yet another automaton-based general algorithm for filtering with profiles expressed by linear XPath expressions.This new filtering algorithm is based on a backtracking deterministic finite automaton derived from the classic Aho-Corasick pattern-matching automaton [Aho and Corasick 1975]; it is called the pattern-matching-automaton-based (or PMA-based ) filtering algorithm.The size of the PMA is linear in the sum of the sizes of the filters, and the worst-case time bound of the algorithm is much less than the time needed for simulating nondeterministic automata, which is the worst-case time bound of YFilter [Diao et al. 2003].Thus, when the worst-case time bound is concerned, the PMA-based filtering is superior both to YFilter and to the lazy DFA algorithm [Green et al. 2004]; the latter has the worst-case space and time bound Ω(2 k ), where k denotes the number of filters.
Our new algorithm and its predecessor [Silvasti et al. 2009a] work in the same way if only child and descendant operators are present.This implies that our new PMA-based filtering algorithm will be very efficient, in the same way as its predecessor [Silvasti et al. 2009a], when wildcards and descendant operators can be pruned out.
This article is organized as follows.In Section 2 we present the basics of automata-based filtering, and prove the optimal time complexity of the PMAbased filtering algorithm, when only the child operators in addition to a leading descendant operator are allowed.Section 3 is devoted to the analysis of a new algorithm of multiple-pattern matching when wildcards are allowed in the patterns.Algorithms are presented both for pattern matching of linear text and for filtering of XML documents.In Section 4 we present and analyze our algorithm that, using the results of the previous sections, solves the XML filtering problem for full linear XPath expressions (without predicates).

Automaton-Based Filtering
A linear XPath expression (without predicates) is defined as a sequence of XMLelement labels or wildcards separated by child (/) or descendant (//) operators; in other words a sequence of the form where each op i is "/" or "//", and l i is an XML-element label or a wildcard " * ".
A filter is a union of sequences of form (1).
The filtering problem can now be expressed as a language recognition problem over alphabet Σ, the set of XML element labels that possibly occur in the input documents.Let F = {f 1 , . . ., f k } be a set of filters f i each of which is a union of sequences of form (1).Moreover, let fi denote the union of languages Ŝ obtained from sequences S of the form (1) as follows: (i) substitute Σ * for occurrences of "//"; (ii) substitute Σ for occurrences of " * "; (iii) interprete occurrences of "/" as string-catenation operators (and hence omit); (iv) finally append Σ * at the end of Ŝ.
Then construct a DFA D that accepts the language F = f1 ∪ . . .∪ fk in such a way that, for each w in F , D also reports all those filters f i for which the language fi contains w.To construct such a DFA is possible by simply marking final state q by index i whenever q was set as a final state because of the acceptance of a string in fi .However, such a DFA containing different types of final states becomes easily exponential in size because in the worst case there is a final state for each different subset of { f1 , . . ., fk }.
When processing an XML stream the input document will first be parsed by a SAX parser, which produces a parse tree from the document.In filtering, each path from the root to a leaf in this parse tree is fed to the DFA D, and at the end of the path the accepted filters are reported.It is not necessary to completely construct the parse trees, but we can perform the evaluation in conjunction of parsing.A stack of states need to be maintained in order to backtrack correctly when several paths have the same prefix.
SAX events are processed as follows.(See e.g. the article by Green et al. [2004] for a more detailed description.)When the beginning of an XML element is encountered, the current state q will be pushed onto the stack and the new current state will be the one accessed from q on the element label.When the end of an XML element is encountered, the state on top of the stack is popped and set as the new current state.
As an example, consider the set of three filters When considered as a language recognition problem, filtering with respect to these three filters can be done by a DFA that accepts the language L = L 1 ∪ L 2 ∪ L 3 , where L i denotes the language Σ * a i Σ * , and decides by final states which of the three filters are matched.The minimized DFA is shown in Fig. 1.
At final state i, 1 ≤ i ≤ 3, only language L i is recognized, at final states 4, 5 and 6, the sets {L 1 , L 2 }, {L 1 , L 3 }, and {L 2 , L 3 }, respectively, and at final state 7, the set {L 1 , L 2 , L 3 }.Note that the DFA of Fig. 1 cannot be further minimized, because the final states all accept different subsets of {L 1 , L 2 , L 3 }.
In general, Ω(2 k ) states are included in the minimal DFA that solves the filtering problem for k (different) keyword sets, that is, each filter is a union of sequences of the form //a 1 /a 2 / . . ./a p , where a 1 a 2 . . .a p is called a keyword.When filters are composed of keyword sets, the filtering problem can be solved efficiently by using the pattern-matching automaton by Aho and Corasick [1975].
The use of this automaton is based on the acceptance by an output function: If in a state q filter f i is recognized (that is, string fi is accepted at q), then output(q) is defined to contain i.An array result, indexed by filter numbers, is used to store information about matched keywords in the filters.If the patternmatching automaton has recognized a keyword in filter f i , then result[i] is set to 1 (initially result[i] = 0 for all filters f i ).When the input document has been processed, the result of the filtering can be read from the array result.
The basic version of the pattern-matching automaton (PMA) as defined by Aho and Corasick [1975] is composed of the goto and failure functions that dictate the next-state transitions when processing the input.For each prefix y of some keyword, the PMA has a unique state, denoted state(y), different from all state(y ), where y = y.The state state( ), where is the empty string, is the initial state of the PMA.Clearly, the number of states in the PMA is at most ||F || + 1, where ||F || denotes the size of the filter set F (composed of keyword sets), that is, the sum of the lengths of all keywords in F .
The goto function of the PMA is defined by the equation goto(state(y), a) = state(ya), where ya is a prefix of some keyword and a is an element in Σ.The fail function of the PMA is defined by the equation fail(state(uv)) = state(v), where uv is a prefix of some keyword and v is the longest proper suffix of uv such that v is also a prefix of some keyword.For nonnegative integer k, we denote by fail k the fail function applied k times: fail 0 (q) = q, and fail k+1 (q) = fail(fail k (q)).
For any state q we denote by string(q) the unique string y with state(y) = q.In our filtering application, the output function of the PMA is defined by setting for each state q: (2) A straightforward explicit implementation of the output function would make its size quadratic.Consider for example the filters The PMA for this filter set is depicted in Fig. 2. The size of the filter set is 3n, but the size of the output function is n 2 + 2n + 1.
To circumvent this undesirable size growth we store the output sets as linked lists in the following way.First we define and a way to reach all nonempty direct output sets without passing also the empty ones: The function output fail is defined by setting for state q: output fail(q) = fail k (q), if k is the greatest integer less than or equal to the length of string(q) such that direct output(fail m (q)) is empty for all m = 1, . . ., k − 1.We have: and the representation of the output function as linked lists of direct output sets is of size O(||F ||).
Proof.A straightforward induction on the length of the path from state q consisting of the output fail arcs.The representation of the output function as linked lists is obtained by linking the direct output sets together in the prescribed order.
The PMA as constructed above can be used as a filtering machine exactly as a DFA.Because of the fail arcs (but only at most one per state), it must be prescribed that fail arcs are applied only when goto arcs cannot be applied.For input of size n at most 2n transitions can be performed.
When processing a stream of XML documents, the current state will be pushed onto the stack if a start-element tag is encountered and a goto arc on the element can be applied but not when a fail arc is used.When an end-element tag is encountered, it is discarded, the next element is scanned, and the state on top of the stack is taken as the new current state.
Theorem 1.Let f 1 , f 2 , . . ., f k be filters each of which is composed of a set of keywords.Then the filtering problem for filters f 1 , f 2 , . . ., f k can be solved in time O(n + m), where n is the length of the input document (counted as the number of XML elements) and m is the sum of the sizes of the filters.
Proof.We construct the PMA from the keywords in F = f 1 ∪ . . .∪ f k as described above.If a keyword of f i leads from the initial state to state q, the set direct output(q) is set to include i.By the results of Aho and Corasick [1975] and by Lemma 1, this construction takes time linear in the size of F .
Whenever, during the processing of an input, state q with a nonempty output set is entered, result[i] ← 1 for all i ∈ output(q), after which output(q) is set to empty.Once processed output(q) can be set to empty, because in the filtering application we do not want to determine all occurrences of keywords but only the first.Setting output(q) to empty is necessary, because otherwise we would not obtain the desired time bound O(n + m) but we should add a term denoting the number of keyword occurrences.
When the process has finished, the question of whether or not the input x matches with filter f i (that is, x ∈ fi ) is answered by checking result[i]: x ∈ fi if and only if result[i] = 1.
As an example, Fig. 3 (a) shows the PMA obtained from the keywords of the filters f 1 = //a 1 , f 2 = //a 2 , f 3 = //a 3 .This PMA is augmented with the output function output(q) that contains i if a i has been read at q.
The failure transitions of the PMA can be eliminated by using the next-move function of a DFA in place of the goto and fail functions: where k is the least nonnegative integer for which goto(fail k (q), a) is defined, that is, the state fail k (q) has a goto transition on element a.   3 (a).Even though the DFA has exactly as many states as the PMA, the number of its (next-move) transitions is quadratic in the size of the keyword set.
If filters are composed of sequences of several keywords, so that they may also contain non-leading descendant operators "//", it is still possible to base the filtering algorithm on matching of keywords.For filter i that consists of a sequence of l keywords, a possible match has been found if j ∈ result [i] for all j = 1, . . ., l.Such a possible match is not always an exact match, because the matched keywords may not appear on the same path, nor in the specified order.
To find out which possible matches of multi-keyword filters are true matches, the input document must be filtered through some more general filtering algorithm, as through the one that will be presented in the following sections.
Another problem to be solved is that the multi-pattern matching algorithm of Aho and Corasick [1975] cannot handle keywords with wildcards.In the next section, we will present a new algorithm for multiple-pattern string matching when wildcards can appear in the patterns.

Multiple-Pattern Matching with Wildcards
It is typical that XPath expressions describing XML filters contain wildcards ( * ).As our intention is to derive a backtracking version of the multiple-pattern matching algorithm of Aho and Corasick [1975] for XML filtering, we first need to extend the basic algorithm to handle patterns with wildcards.
In Section 2 we considered the problem of matching filters that are keyword sets, but now we allow the filters to be composed of sets of sequences, called patterns (with wildcards), of the form where w 1 , w 2 , . . ., w k are keywords in Σ + , k ≥ 1, n 1 is a nonnegative integer, and n 2 , . . ., n k are positive integers.
In our algorithms, the length m i of pattern P i is denoted by length(i), and the number of keywords of pattern P i is denoted by #keywords(i).For pattern P i and the kth keyword w k in P i , length(i, k) gives the length of the keyword, and distance(i, k) gives the distance of the keyword from the beginning of P i , that is, where n j is the length of the wildcard string preceding the jth keyword in P i .
For example, for patterns we have: For solving the multiple-pattern matching problem with wildcards we construct the PMA as defined in Section 2 from the set of all keywords that appear in the patterns.This idea was previously used to solve the single-pattern matching problem [Pinter 1985] (also see the articles by Rahman et al. [2006] and Rahman and Iliopoulos [2007]).However, the algorithms presented in these articles cannot be directly extended to yield an efficient solution to the multiple-patterns matching problem.
The output function of the PMA is now defined as: is the kth keyword of pattern P i }.
A pair (i, k) in output(q) is called an output tuple for state q; the tuple signals the recognition at state q of the kth keyword of pattern P i ; this keyword is a suffix of string(q).The basic idea of the algorithm is to collect, for each pattern P i having more than one keyword, partial matches of P i that represent matches of maximal prefixes of P i found thus far.When a match up to and including the last keyword of P i has been found, we have a full match of the pattern.
Partial matches of pattern P i are stored in a set partial matches(i) that contains pairs of the form (p, k) meaning that, starting at element position p in the input text, an occurrence of the prefix of P i up to and including the kth keyword has been found.The algorithm simulates the PMA, and at state q, for all output tuples (i, k) with #keywords(i) > 1, the set partial matches(i) is updated: (i) If k = 1, a new possible start of P i is recorded by inserting (p, 1) into partial matches(i), where p is the element position of the start of the wildcard string preceding the keyword string(q) recognized at state q, in other words, where element count gives the number of elements scanned thus far.
(ii) If k > 1 and partial matches(i) contains (p, k − 1) with then this partial match (p, k − 1) obtained thus far can be extended to also include the kth keyword.This is done by replacing (p, k − 1) by (p, k) in the set partial matches(i) (when k < #keywords(i)) or by deleting (p, k − 1) (when k = #keywords(i))).
When k reaches #keywords(i), then, instead of recording a partial match (p, k) in partial matches(i), we record a full match of pattern P i by inserting p into the set matches(i).
In order to maintain efficiently the sets partial matches(i), we store them as balanced binary search trees.In each set partial matches(i) there is at most one pair (p, k) for any given p, and thus we may use p as a unique search key for the structure.Moreover, we will also keep a separate finger always pointing to the smallest element of the structure.
When searching the structure in order to replace pair (p, k − 1) by pair (p, k) we also check the element (p , k ) pointed to by the finger.If here p < element count − length(i), we know that (p , k ) can never be extended to a full match and can therefore be deleted.The deletion is repeated until the finger points to an element (p , k ), where p ≥ element count−length (i).In this way we can keep the sets partial matches(i) small, namely, containing at most length(i) elements.
The operation cycle of the PMA is presented as Algorithm 1, and the procedure check output that maintains the sets partial matches(i) is presented as Algorithm 2.
Theorem 2. Let S = {P 1 , . . ., P r } be a set of patterns with wildcards and let T be a linear input text of length n.The set of all occurrences of the patterns of S in the text T can be computed in time where m is the sum of the lengths |P i | of the patterns P i in S, α i denotes the number of occurrences of keywords of P i in T , and the term O(m) represents the time spent on preprocessing the patterns.Here the occurrences of pattern P i are represented by element positions p, and the occurrences of the kth keyword of P i by pairs (p, k), where p is the position in T of the first element of the occurrence.
The occurrences of the same patterns in any new text T of length n can be found in time where α i denotes the number of occurrences of keywords of P i in T .
Proof.We construct the PMA for the set of all keywords that appear in the patterns in S as described above.By construction, this PMA is of size O(m) and can be constructed in time O(m).
We use Algorithm 1 to simulate the PMA.At each state q all output tuples (i, k) such that the kth keyword of P i is a suffix of string(q) will be checked for partial matches: if the set partial matches(i) contains (p, k − 1) meaning that the prefix of P i including the (k − 1)th keyword has been found, (p, k − 1) is replaced by (p, k).Now if a whole match of P i is obtained, p is inserted into the set matches(i).
Each set partial matches(i) can contain at most |P i | elements, because for each p there can only be one pair (p, k) and because all elements (p , k ) with p < element count − |P i | will be deleted in conjunction with possible updating of the set.When partial matches(i) is organized as a balanced binary search tree indexed by unique key p, we conclude the bound O(α i log |P i |), where α i denotes the number of occurrences of keywords of P i , for the total number of checks in set partial matches(i).
Altogether Assume then that instead of linear input text we are processing a stream of XML documents, and that the filters are composed of patterns with wildcards.That is, each filter f i is of the form (5) with a leading descendant operator "//".We can use the construction of Theorem 2 yielding a PMA for all keywords appearing in the filters with an output function defined by In this case backtracking is not as simple as in the case when filters do not contain wildcards (Theorem 1).When upon encountering a start-element tag of an element a goto arc is applied on the element, the current state will be pushed onto the stack, but also changes in the sets of partial matches computed at that state must be logged onto the stack.More specifically, depending on the statement used to change partial matches(i), a tuple containing enough information for performing the reversal of the change will be pushed onto the stack.
Statements of the forms "insert (p, 1) into partial matches(i)" and "delete (p, k) from partial matches(i)" cause tuples inserted i, p, 1 and deleted i, p, k , Algorithm 2 Procedure check output(state) for matching linear text with sets of patterns containing wildcards.respectively, to be logged, and a statement of the form "replace (p, k − 1) by (p, k) in partial matches(i)" causes tuple replaced i, p, k to be logged.
Then when an end-element tag is encountered the topmost state in the stack will be the new state, but before continuing processing at that state the reversal operations based on the logged tuples above the topmost state must be performed.For tuple inserted i, p, 1 , the statement "delete (p, 1) from partial matches(i)" is performed.For tuple deleted i, p, k , the statement "insert (p, k) into partial matches(i)" is performed.For tuple replaced i, p, k , the statement "replace (p, k) by (p, k − 1) in partial matches(i)" is performed.
There is still one important point that must be taken into account when processing paths in a tree instead of linear text.This is that we must also correctly update the variable element count such that its value is the number of elements in the current path.But this is accomplished by simply decrementing the counter by one whenever the current state is obtained from the stack.
It is clear that all reversal operations caused by backtracking can be performed as efficiently as the original operations, and the number of reversal operations cannot be greater than the number of original operations.Thus we have: Theorem 3. Let f 1 , f 2 , . . ., f r be filters each of which is of the form //P i , where P i is a pattern composed of XML element names, child operators (/), and wildcards ( * ).Then the XML filtering problem for document T of length n (counted as the number of XML elements) can be solved in time O(n + m + r i=1 α i log |P i |), where m is the sum of the lengths of the filters and α i denotes the number of occurrences of keywords of P i in T .
Proof.The claim follows from the result stated in Theorem 2 when we observe that each reversal operation has the same cost as the operation to be reversed.Moreover, altogether the number of reversals cannot be more than r i=1 α i , because each of them either inserts or deletes a keyword occurrence, and never more than once.

Matching linear XPath expressions
In this section we are finally ready to present our main result which states how multiple-pattern matching for streams of XML documents can be done efficiently when for the filters the full generality of linear XPath filters (without predicates) are allowed.More specifically, we consider matching with linear XPath filters that contain descendant operators (//) also in non-leading positions, that is, filters of the form where each subsequence P i,j is a pattern composed of XML element names, child operators (/), and wildcards ( * ).The special case in which the filter begins with "/" instead of "//" is handled by requiring that the SAX parser surrounds each document by tags # and /# , where # is new element name.Every filter of the form /g is transformed into //#/g, thus being of the form (6). We call the subsequences P i,j segments of the filter.Each segment is partitioned into one or more keywords as are the patterns considered in the previous section.
For example, the filter The number of segments of filter i is given by #segments(i), and the number of keywords of segment j of filter i is given by #keywords(i, j).For filter number i, segment number j and keyword number k, length(i, j, k) gives the length of the kth keyword of segment j of filter i, distance(i, j, k) gives the distance of the kth keyword from the beginning of the segment, and length(i, j) gives the length of segment j of filter i.
If the number of the above example filter is i, we have: As in the previous sections, the PMA is constructed from the set of all keywords that appear in the filters.
Output tuples in the output sets of states now take the form (i, j, k), where i is a filter number, j is a number of a segment of filter i, and k is a number of a keyword of segment j.
Partial matches of segment j of filter i are recorded in set partial matches(i, j) containing pairs (p, k), where p is an element position in the input document denoting the possible start of a match of segment j of filter i, and k is the number of the keyword of segment j up to which the match has been found.As with the sets partial matches(i) in the previous section, there is at most one pair (p, k) in partial matches(i, j) for any given element position p.Thus we can store partial matches(i, j) in the same way as partial matches(i) as balanced binary trees with key p.
A partial match (p, k) in partial matches(i, j) is a full match of segment j of filter i if k = #keywords(i, j).The sets partial matches(i, j) are actually maintained in the same way as the sets partial matches(i) of the previous section, the only difference being that a new partial match of segment j > 1 at position p can only be started if a match of segment j − 1 has been found at some previous position p , and far enough from p.In other words, p ≤ p − length(i, j − 1).Because of this condition, a full match of segment j actually signals full matches of all segments of the entire filter from segment 1 upto and including segment j.
Full matches of segment j of filter i are collected into the set matches(i, j); a match of an entire filter i has been found when a full match of the last segment of the filter has been found, that is, when the set matches(i, j) with j = #segments(i) becomes nonempty.It is easy to see that sets partial matches(i, j) can be maintained in time O(α i,j log |P i,j |) time, where α i,j denotes the number of occurrences of keywords of P i,j in the input XML document being filtered (cf.Section 3).Additionally, we need be able to test whether or not p ≤ p − length(i, j − 1), for some p in matches(i, j − 1).This can be simply accomplished by maintaining the sets matches(i, j) as balanced binary trees, and checking the condition for the smallest element in the tree.The total time needed for each matches(i, j) is O(β i,j log β i,j ), where β i,j denotes the number of occurrences of P i,1 //P i,2 // . . .//P i,j .
We have: Theorem 4. Let f 1 , f 2 , . . ., f r be filters each of which is of the form f i = //P i,1 //P i,2 // . . .//P i,mi , Algorithm 6 Procedure backtrack().s ← stack.pop()while s is not a state do if s = inserted i, j, p, 1 for some i, j, p then delete (p, 1) from partial matches(i, j) else if s = replaced i, j, p, k for some i, j, p, k then replace (p, k) by (p, k − 1) in partial matches(i, j) else if s = deleted i, j, p, k for some i, j, p, k then insert (p, k) into partial matches(i, j) end if s ← stack.pop()end while return(s) elements) can be solved in time where m is the sum of the lengths of the filters and α denotes the number of all occurrences in T of any keywords of the filters.

Conclusion
In this article we have presented a new algorithm for matching linear XPath expressions (without predicates) with XML documents.The application we had in mind was filtering with filters expressed as linear XPath expressions; that is, for each document in an XML document stream, the task is to determine those filters that match the document.The basic building block in this algorithm is the Aho-Corasick multiple-pattern-matching algorithm, which we have extended in two novel ways.First, we showed how it can be efficiently applied when wildcards are present in the patterns, and second, we showed how tree patterns can be matched with tree-like text.Specifically, we used as patterns linear XPath expressions and XML documents as tree-like text.
Our main results (Theorem 4 and Theorem 5) are contributions in the sense that they state new worst-case bounds for XML filtering.Our future work aims at improvements by dynamically using the information of matched prefixes of patterns, without explicitly trying to match portions of patterns for which no corresponding matched prefix exists.

/
/a/b/ * / * /c// * / * /d has two segments, namely a/b/ * / * /c and * / * /d, where the keywords of the first segment are ab and c, and the only keyword of the second segment is d.
we conclude the claimed time bound O(n + m + r i=1 α i log |P i |).The time bound O(n + r i=1 α i log |P i |) for any new text T comes from the fact that the O(m) time for constructing the PMA is not needed.