Automatic Index Selection for Large-Scale Datalog Computation


 Datalog has been applied to several use cases that require very high performance on large rulesets and factsets. It is common to create indexes for relations to improve search performance. However, the existing indexing schemes either require manual index selection or result in insufficient performance on very large tasks. In this paper, we propose an automatic scheme to select indexes. We automatically create the minimum number of indexes to speed up all the searches in a given Datalog program. We have integrated our indexing scheme into an open-source Datalog engine S
 OUFFLÉ.
 We obtain performance on a par with what users have accepted from hand-optimized Datalog programs running on state-of-the-art Datalog engines, while we do not require the effort of manual index selection. Extensive experiments on large real Datalog programs demonstrate that our indexing scheme results in considerable speedups (up to 2x) and significantly less memory usage (up to 6x) compared with other automated index selections.



INTRODUCTION
There has been a resurgence in the use of Datalog in several computer science communities [18], including program analysis where it is used as a domain specific language for succinctly specifying various classes of static analyses. In this setup, an input program to be analyzed is converted into an extensional database (EDB), while the analysis specification is encoded as a set of Datalog rules that compute the analysis result as an intensional database (IDB). Figures 1a and 1b depict a simplified taint analysis encoded as a Datalog program, used for detecting the vulnerabilities of a web-based hospital management system. The source code of the management system is converted into EDB relations, e.g., Src, Sink, Role, Access, Zone, and Priv, where relations Role and Access for access policy are shown in Figure 1a.

Example 1.
Datalog rules are constructed for the security analysis, enumerating all possible vulnerability cases. This part of the ruleset is shown in Figure 1b; we omit the Datalog rules for computing the IDB relation Path that defines the control flow of the source code. For example, the first rule adds an error involving source code locations s and e to the appropriate IDB relation, whenever s is a user (uid) input (Src) for which there exists a program path to e, a database connection (Sink) location that has no Role for uid. That is, the user has connected to the database without a role. Note that, as conventional, underscore is used for anonymous variables whose values are not important. The results of the analysis, determining the set of error paths, are stored in the IDB relation Err.
Such use cases of Datalog in program analysis, typically consist of hundreds of rules and result in giga-tuple sized IDB relations, as shown in [21,37]. Several specialized high-performance Datalog engines [19,21,24] have been employed for performing such computations. These engines use Datalog as a computational notation, and exploit bottom-up evaluation techniques that usually involve some degree of compilation. To reduce lookup times, relations are stored as in-memory, index-organized tables [24,36]. Selecting the appropriate indexes in this setting requires novel techniques, compared to those standard for physical design in relational platforms.
The theory of the index selection problem (ISP) for relational database management systems [12,20,22,31] uses variants of the 0-1 knapsack problem, which has been shown to be NP-hard [26]. Deployed approaches such as [11] use heuristics and integrate with what-if query optimization calculations. These techniques are surveyed in Bruno [8], but they are too computationally expensive for large Datalog analyses. Essential differences include (i) indexes are needed for both EDB and IDB relations, (ii) the Datalog relations are often wide (not normalized), and thus they offer a very large number of possible indexes, and (iii) the Datalog programs typically consist of hundreds of relations and hundreds of deeply nested rules (see Table 2 in Section 7). As a result, the specialized Datalog engines often require users to provide annotations to guide the choice of indexes; for example, the DOOP framework [37] uses a code-rewriting technique that manually chooses an index for each relation and introduces "Opt" relations for building multiple indexes on a relation. To allow widespread use of program analysis, we must move beyond approaches that put the optimization burden on the user, who requires painstaking trial and error over hundreds of rules and annotations.
(b) Datalog rules for vulnerability detection for all t 1 ∈ Src do for all t 2 ∈ σ x=t 1 (y) (Path) do for all t 3 ∈ σ x=t 2 (y),z="Con" (Sink) do if σ x=t 1 (x) (Role) = ∅ then if (t 1 (y), t 2 (y)) ∈ Err then add (t 1 (y), t 2 (y)) to Err (c) Nested loop joins for Datalog rule (r1) Figure 1: Example Datalog analysis for vulnerability detection gine SOUFFLÉ [21]. We found inadequate performance until we introduced our new technique into SOUFFLÉ, however the ideas should apply more broadly to any engine that computes a Datalog program in successive phases. That is, initially there are analysis phases that consider only the rules and produce code to perform a query evaluation plan resembling a nested loop join, and these are followed by an evaluation phase that executes the compiled query on the facts (i.e., IDB), producing a materialized IDB. Our autoindexing is conducted at one of the analysis phases, and it chooses indexes that improve the performance of the compiled code.
The key insights of our work are as follows. We identify that the compiled evaluation is built from frequently repeated calls to simple selections, each on a single relation (which might be in EDB or in IDB). We call these primitive searches, and a primitive search returns the tuples in a relation which satisfy a predicate that involves testing some of the attributes for equality to a given value. For example, Figure 1c depicts the evaluation logic that is compiled for the Datalog rule (r1) in Figure 1b where the first, second, and third attributes of a relation are assumed to be accessed by x, y, and z, respectively. There are three primitive searches σ x=t 1 (y) (Path), σ x=t 2 (y),z="Con" (Sink), and σ x=t 1 (x) (Role), where the first one looks up all tuples in relation Path whose first attribute value is equal to t1(y) -the second attribute value of a tuple t1 from relation Src. Note that each primitive search is a very restricted kind of range query: for each attribute, we are either checking equality to a value, or else we accept any value in that attribute.
Our next insight is that the evaluation of a primitive search can be greatly sped up if the relation has a clustered B-tree index that covers the search predicate. This means that the set of attributes where equality is checked, forms a prefix of the sequence of attributes used to lexicographically define the index. For example, the primitive search σx=v 1 ,z=v 3 is covered by the index = x ≺ z (that means, an index using x followed by z as its key) but not by = x ≺ y ≺ z. When a search is covered by an index, the tuples that match the search are a contiguous part of the scan of the index leaves. Accessing these can be much faster than a full table scan, which is what an engine would use in the absence of an index. Because the relations are so large, we find that queries are typically infeasible in practice unless there is some index to cover every primitive search among the rules. On the other hand, each index uses considerable space, and so we are driven to minimize the number of indexes constructed. Thus we define an abstract task, the Minimum Index Selection Problem (MISP), aiming to select the minimum number of indexes to cover all primitive searches used in the ruleset. We notice that this can be significantly fewer than one index for each primitive search on the relation. For example, the index = x ≺ y ≺ z covers three primitive searches: S1 = σx=v 1 , S2 = σ x=v 1 ,y=v 2 , and S3 = σ x=v 1 ,y=v 2 ,z=v 3 .
Finally, we are able to solve the MISP efficiently, using a relationship between the search space of indexes and the search space of search chains among lexicographic orders. To do so, we abstract each primitive search as its set of search attributes, referred to as a search; for example, S1 = {x}, S2 = {x, y}, and S3 = {x, y, z} are the searches corresponding to the above three primitive searches. A sequence of k searches S1, . . . , S k form a search chain if each search Si is a proper subset of its immediate successor search Si+1. As a result, all searches in the same search chain can be covered by a single index. We prove that the optimal MISP solution can be constructed from the optimal set (i.e., with the minimum cardinality) of search chains that cover all primitive searches. Then we apply the combinatorial result of Dilworth's theorem [15] to compute the minimum number of search chains, and thus the minimum number of indexes, in O(|S| 2.5 + |S| 2 · m) time, for a set S of primitive searches on a relation with m attributes. This is much faster than a brute force examination of all possible sets of indexes on this relation, which would have a time complexity of O(2 m m ). We have implemented our index selection approach as the default indexing technique of the SOUFFLÉ Datalog engine. We found that the computation overhead for our index selection is negligible, i.e., no slowdowns were observed during compilation. Using our technique, SOUFFLÉ has managed to efficiently compute program analyses typically deemed too large for Datalog engines, and moreover, the performance exhibited by SOUFFLÉ has been on a par with recent state-of-the-art hand-crafted analyzers [14].
Contributions. Our contributions are summarized as follows.
• We formally define the minimum index selection problem (MISP) to find the minimum number of indexes to cover all primitive searches. • We present a polynomial-time algorithm to solve MISP optimally via computing search chains. • We formulate an automatic indexing scheme for large-scale Datalog computation based on this theory. • We demonstrate the effectiveness of our indexing scheme in an open-source Datalog engine, SOUFFLÉ, with large, real world rulesets and factsets.
Note that this paper builds on prior work [33,21] by some of the same authors. In [33] we introduce an overview of SOUFFLÉ as a compilation framework for Datalog, and the paper [21] is a tool paper focusing on the synthesis of C++ via Futamura projections. Both papers mention index selection, however, neither the theory nor the implementation details of index selection is filled in by prior papers. This work introduces the formal problem definition of MISP, precise algorithms for solving MISP, brute-force estimates, proof sketches, and evaluation of MISP in comparison to other index selection techniques.
We remark that our scheme is based on clustered B-tree index structures kept in-memory. If multiple indexes are needed, we materialize replicas of the relation so that each index can be clustered.
Organization. The paper is organized as follows. We highlight related works in Section 2, and present preliminary definitions in Section 3. In Section 4, we introduce an automatic indexing scheme and formally define the minimum index selection problem (MISP). In Section 5, we present a polynomial-time algorithm to solve MISP optimally. We evaluate our automatic indexing scheme in an open-source Datalog engine in Section 7. We discuss other extensions of our techniques in Section 8 and draw relevant conclusions in Section 9.

RELATED WORK
Datalog Engines. Datalog has been pro-actively researched in several computer science communities [9,28,29,30], where a comprehensive introduction to Datalog can be found in [1]. Driven by applications in data integration, networking, and program analysis, Datalog has recently regained considerable interests, e.g., see [18] for a survey of these developments. Logicblox [3] is a commercial propitiatory system which focuses on encoding business logic. The latest version 4 of LogicBlox is single-threaded execution and less amenable for recursive queries. Hence, Logicblox cannot be directly employed for highly-recursive workloads occurring in static program analysis. BigDatalog [36] is a Datalog system that executes queries on the unified analytics engine Apache Spark. The system is designed for recursive aggregate queries and applications typically found in social network and other data-analytics application with large-data. The aim of BigDatalog is to exploit coarse-grain parallelism in Datalog programs. Static program analysis and security analysis have different workload characteristics requiring a fine-grain parallelism caused by a large number of mutual recursive relations with several hundred rules. Datalog-MC [39] uses an in-memory parallel evaluation of Datalog programs on shared-memory multi-core machines. Datalog-MC hash-partitions tables and executes the partitions on cores of a shared-memory multi-core system using a variant of hash-join. To parallel evaluate Datalog, Datalog rules are represented as and-or trees that are compiled to Java. Flix [25] is a new Datalog-inspired domain specific language for static program analysis extending the expressiveness of Datalog with arbitrary lattice structures. Flex does not have as a research objective performance rather expressiveness.
Other Datalog platforms. Note that the use-cases of static program analysis requires particular capabilities in Datalog engines including fast fixed-point calculations for highly mutual recursive relations with very deep joins, domain specific extensions of Datalog including complex element types, components, and widening-techniques. Various engines have been used for static program analysis including Logicblox version 3 [23], µZ [19], bddbddb [38], and SOUFFLÉ [21], which is currently the state-ofthe-art Datalog engine used in Java points-to [6], Amazon's AWS cloud, and Smart-Contract analysis [17].
Index Selection in Datalog Engines. Consider PA-Datalog, which is a variant of Logicblox version 3, and has been used in DOOP for program analysis. This engine stores each relation (whether EDB or IDB) in an index, where the structure is based on the order of attributes as listed in the relation. As shown in [7], execution efficiency of DOOP can be greatly improved by a manual code-rewriting technique [2], which replicates a relation multiple times (corresponding to attributes listed in different orders) and thus it creates a distinct index for each replica. This manual index creation, although resulting in an enormous speedup [7], requires end-users to be familiar with the underlying indexing mechanism of a Datalog engine. The manual code-rewriting technique is error-prone and consumes much human time and effort, on programs with hundreds of rules. Also, the hand-optimized Datalog rule-sets become obfuscated, and maintainability and readability are hampered. In contrast, we seek an automated approach to identifying appropriate index structures. In bddbddb the system chooses a global variable order, and indexes each relation once, according to the restriction of the global order to the attributes of the relation. This means that many searches are not able to use the index with an increasing number of rules and relations occurring in standard static program analysis workloads.
Index Selection in Relational Databases. A recent monograph on optimizing performance of SQL queries is given by Bruno [8]. One aspect is physical design, including the choice of index structures. In the context of relational databases, the problem of automatically selecting indexes for a set of database queries, referred to in the literature as the index selection problem (ISP) [12,20,22,31], is well studied and has been shown to be NP-hard [26]. It is typically formulated as a variant of the 0-1 knapsack problem, which balances the overall execution time of queries for an index configuration (i.e., a subset of indexes that influence the performance of a query) and the cost of index maintenance. Our index selection problem differs from the classic ISP literate and to the best of our knowledge is the first formulation for Datalog. Firstly, in our case, we only need to support primitive searches, which occur in equi-joins and simple value queries. Secondly, the nature of Datalog restricts the search predicate of each primitive search to be an equality predicate over the attributes of the relation. We further assume that each primitive search benefits from being indexed. Thus, we formulate our problem as automatically selecting the minimum number of indexes to cover all searches, and we show that unlike the relational problem, we can solve it optimally in polynomial time. In contrast to semi-automatic techniques such as WFIT index tuning algorithm [32] that are aimed at relational databases, our approach is designed to be fully automatic and computed on the fly at compilation time. Offline index selection approaches aimed such as AutoAdmin [10] relay recommendations to a DBA by performing a cost based analyses of a workload. The DBA then makes the final selection based on the feedback. Our approach is designed to automatically select the optimal index set on very large rulesets, and, hence, is designed to scale to large Datalog programs by having minimal overheads. Our algorithm however, can be used in conjunction with a manual or automatic join selection algorithm to provide an additional optimal index set cost metric to aid in general query optimization.

PRELIMINARIES
Like database queries, Datalog programs also work on relations. A relation R is a subset of an m-ary Cartesian product D = D1 × · · · × Dm (i.e., R ⊆ D), where Di (1 ≤ i ≤ m) are the domains of the relation. Elements of a relation R are referred to as tuples. Each tuple t = e1, e2, . . . , em ∈ R has a fixed length m, and ei is an element of the domain Di for 1 ≤ i ≤ m.
Given a relation R, attributes are used to refer to specific element positions of tuples of R. The set of attributes of R, denoted by AR = {x1, . . . , xm}, are m distinct symbols, and we write R(x1, . . . , xm) to associate symbol xi to the i-th position in the tuples. The elements of a tuple t = e1, . . . , em can be accessed by access function t(xi) that maps tuple t to element ei. For example, given a relation R(x, y, z) and a tuple t = e1, e2, e3 ∈ R, the access function is {t(x) → e1, t(y) → e2, t(z) → e3}.

Datalog Program Computation
A Datalog program P consists of a finite set of Datalog rules {r1, r2, . . .}, each of the form: Each Rj(Xj) is called an atom, where Rj is a relation name and Xj is a sequence of constants, variables, and symbol " " indicating irrelevance; for example, R(u, , 1) where u is a variable. R0(X0) is called the head of the rule, and other atoms form the body of the rule. The semantic meaning of a Datalog rule is that given a binding of all variables to constants, the head of the rule holds if each atom in the body of the rule holds. In this paper, we allow negated predicates in the body, but we limit its usage by the semantics of stratified Datalog (see [1] for the details of stratified Datalog).
The set of relations that appear in the heads of P 's rules are referred to as the intensional database (IDB), while the set of other relations are referred to as the extensional database (EDB). In a Datalog program, tuples of the EDB are given, and the system computes tuples of the IDB. This is typically achieved by a bottomup evaluation of the set of rules [1]. In brief, the process starts from an instance I of P that consists only of EDB tuples (also called facts). Then, an immediate consequence operator ΓP is repeatedly applied to I to generate new IDB tuples to be included into I. The process completes when a fixed-point is reached, i.e. no more IDB tuples can be generated.
Primitive Search for Datalog Rule Evaluation. In the bottom-up evaluation process, a Datalog rule is typically evaluated via nested loop joins. For presentation simplicity, we partition the sequence of body atoms of a Datalog rule into positive (referred to as R + i ) and negative (referred to as R − j ) occurrences (i.e., negative if it is negated in the body), and restate the above Datalog rule as where h is the number of positive atoms. Then, this Datalog rule is evaluated via nested loop joins, as shown in Figure 2. Note, the ordering may change due to leveling, i.e, negative predicates hoisted to outer loops for performance reasons. In the nested loop joins, we iterate over tuples that are obtained from a primitive search, which will be defined shortly, on a positive relation. Then, negative occurring atoms are tested for emptiness with respect to primitive searches. Finally, the appropriate attributes of the tuples involved in the current iteration are projected, and a new tuple is inserted into the IDB relation for the head atom of the rule, if that tuple is not already in the relation.
A vital benefit of the nested loop implementation is its memory efficiency. At any time, the system stores the current tuples of the primitive searches on h relations only; there is no need to fully materialize the intermediate results of joining a prefix of the set of relations. The size of intermediate results could easily exceed the sizes of the eventual IDB tables.

Definition 1 (PRIMITIVE SEARCH).
A primitive search has the following form: Here, Ri is a relation and x1 = v1, . . . , x k = v k is a search predicate, where x1, . . . , x k are attributes and v1, . . . , v k are constants.
A primitive search extracts all tuples from a relation that adhere to the search predicate. In this paper, we limit the search predicate to be equalities of left-hand-side attributes and right-hand-side constants as it holds for all the real Datalog programs we tested in Section 7. Note that in our notation {x1, . . . , x k } does not necessarily have to consist of the first k attributes of the relation Ri, and the constants v1, . . . , v k are obtained either from Xi or from other tuples in relations further up the nested loop joins (i.e., t1, . . . , ti−1 in Figure 2) Speeding Up Primitive Searches via Indexing Relations. After constructing the nested loop joins for all rules in a Datalog program, the most critical factor to the performance of evaluating the Datalog program is how the primitive searches are conducted. Obviously, a primitive search can be achieved by conducting a linear scan of all tuples of the relation and checking the search predicate against each tuple. However, the time complexity of linear scan over a relation with n tuples is O(n), which is too costly for large relations considering that each primitive search is invoked repeatedly many times. In this paper, we aim at creating indexes for relations to speed up the primitive searches, and we study the following problem whose formal definition will be given in Section 4.

INDEXING RELATIONS
In this section, we first introduce indexes to speed up primitive searches, and then formally define our problem of minimum index selection.

From Primitive Search to Lex Search
To enable indexes on a relation, we introduce an order among tuples in a relation to make them comparable. Since a tuple may have several elements, an order of tuples is imposed by elementwise comparison using a sequence over all attributes of the relation; this comparison is known as a lexicographical order. We denote an attribute sequence by = x1 ≺ x2 ≺ · · · ≺ xm where ≺ denotes a chaining of elements to form a sequence. Then, given that is formed by all attributes of a relation, a lexicographical order D × D is a total order (i.e., reflexive, asymmetric, transitive) defined over the domain D of the relation with respect to . For two tuples a, b ∈ D, when (a, b) ∈ D × D, we write a b and we say that a is smaller than b with respect to . Note that a a, and for any two different tuples a, b ∈ D, we either have a b or b a but not both. Given an ordered set of tuples, tuple lookups can be performed efficiently using some notion of a balanced search tree, called an index, in which tuples can be found in logarithmic time rather than in linear time. In this paper, we abstract away the underlying implementation details of an index with an attribute sequence, and we use to denote both an index and the attribute sequence based on which the index is constructed. It is worth mentioning that different attribute sequences usually result in different lexicographical orders, and thus different indexes. That is, for tuples a, b ∈ D and attribute sequences and , it is possible that a b and b a. Given an index , we define a lex search as follows.
Definition 2 (LEX SEARCH). A lex search σ ρ( ,a,b) is defined for a relation R ⊆ D and its semantics is given by, where is an index on R, and the lower bound a and the upper bound b are tuples in D.
Constructing Lex Searches from Primitive Searches. As lex searches can be efficiently conducted based on an index, we would like to transform each primitive search σx 1 =v 1 ,...,x k =v k (R) into an equivalent lex search σ ρ( ,a,b) (R). A lex search contains two symbolic bounds a and b, as well as an index , in the lex search predicate. Thus, we need to construct a, b, and , which will be discussed in the following. We assume that the relation R has m attributes in total. Firstly, we describe how to construct the lower bound a and the upper bound b. If k = m, then all attributes of R are in the search predicate, and a = b and they are trivially defined by the search predicate. Otherwise, the primitive search does not specify all attributes of R in its search predicate, and unspecified values need to be padded with infima and suprema values for lower and upper bounds, respectively. We define an unspecified element for the lower/upper bound construction by an artificial constant , and let v k+1 = . We assume that is not element of any of the domains Di. We define a surjective index mapping function i : {1, . . . , m} → {1, . . . , k + 1} that maps the specified elements to their corresponding constant values, and maps the unspecified elements to (i.e., v k+1 ). The construction of the lower and upper bound is performed by the functions lb and ub, respectively, that replace the unspecified value with the infimum ⊥j and the supremum j of the domain Dj, respectively. Formally, the functions are defined as lb(v1, Secondly, we show in Lemma 1 that given a = lb(v1, . . . , v k ) and b = ub(v1, . . . , v k ), we have σx 1 =v 1 ,..., Before that, we first define prefix set.
Definition 3 (PREFIX SET). Given an attribute sequence (i.e., an index) , and an index whose k-th prefix is {x1, . . . , x k }, then holds for any R ⊆ D.
From Lemma 1, to transform a primitive search σx 1 =v 1 ,...,x k =v k (R) into an equivalent lex search, the index for the lex search can be any sequence of all attributes of R such that the first k attributes are x1, . . . , x k in an arbitrary order. Thus, we also use = x1 ≺ · · · ≺ x k which is only a subsequence of the attributes of R to denote an index, since the chaining order of the remaining attributes is irrelevant for the lex search.
Example 2. Consider the primitive searches in the second column of Table 1, their corresponding lex searches are illustrated in the third to fifth columns, where the third column shows the lower bound a, the forth column shows the upper bound b, and the fifth column shows the index . Here, given a primitive search σx 1 =v 1 ,...,x k =v k (R), the index is selected as = x1 ≺ · · · ≺ x k . Thus, each lex search uses a distinct index.
Remarks. The lex searches σ ρ( ,a,b) (R) constructed from primitive searches σx 1 =v 1 ,...,x k =v k (R), as discussed in above, are in a special form. That is, for any attribute xi ∈ {x1, . . . , x k } we have a(xi) = b(xi) = vi, and for any attribute xi ∈ AR\{x1, . . . , x k } we have a(xi) = ⊥i and b(xi) = i. Thus, the results of a lex search form a consecutive interval in the lexicographical order of all tuples of R with respect to . As a result, any one-dimensional order-based index (e.g., B-tree) can be used to implement , and a lex search can be executed in linear-log time in the size of the output in the worst case, i.e., O(|σ ρ( ,a,b) (R)| log n) where n is the number of tuples in the relation R. It is worth mentioning that for general range searches, one would need a multi-dimensional index (e.g., R-tree) to implement , which has a higher time complexity and runs slower than one-dimensional index such as B-tree. Thus, in this paper we only consider the special range searches, which we refer to as lex searches. Lex searches can be supported by onedimensional indexes.
It is easy to construct an example where a particular primitive search cannot be transformed into a lex search using a particular index , and thus this search cannot be sped up by . For example, for R(x, y) = { 1, 1 , 1, 2 , 2, 1 } and = x ≺ y, we have σy=1(R) = { 1, 1 , 2, 1 } which is the first and third tuple in the lexicographical order . In view of this, we say that an index covers a primitive search if it can be used to speed up the primitive search by a lex search. We have the following corollary. As the lex search that is transformed from a primitive search is uniquely determined by the index and the primitive search, we focus our discussions on indexes rather than lex searches in the remainder of the paper.

Minimum Index Selection
Due to the lower look-up time complexity of lex searches compared with that of linear scan, indexes are essential for efficient Datalog program computations. However, when constructing indexes, the question remains: what is the best set of indexes needed to cover all primitive searches for a given relation. In this section we define the minimum index selection problem.
Before formally defining our problem, we first establish some additional notations. Firstly, we abstract a primitive search σx 1 =v 1 ,...,x k =v k as its set of search attributes, which we refer to as a search and is denoted by S = {x1, . . . , x k }. This is because the constants v1, . . . , v k are irrelevant to index creation. Secondly, given a set S of searches and a set L of indexes on a relation R, we would like to know whether L can cover S. Note that, since all primitive searches with the same set of attributes (i.e., the same search) can be covered by the same index, in the following when referring search set we use the set-based semantics. We formalize this via the l-cover predicate.
Definition 4 (L-COVER). Given a set S of searches and a set L of indexes on a relation R, we define a predicate l-coverS (L) which is true if for every search S ∈ S, there exists an index ∈ L that covers S .
Then, based on the definition of l-cover, we would like to find the smallest set of indexes that cover a search set S. The rationalities of minimizing the number of indexes are as follows. Firstly, following Corollary 1, an index represented by an attribute sequence may cover a multitude of searches assuming the elements of its prefixes coincide with the attributes of the searches. For example, two searches S1 = {x} and S2 = {x, y} on a relation can be covered by the same index = x ≺ y. Secondly, for a search that can be covered by multiple indexes, the benefits of the different indexes are the same, i.e., they will result in the same running time. Thirdly, the fewer the indexes, the lower the creation and maintenance costs of these indexes.
As indexes and searches on different relations are independent, we consider each relation separately. We formulate our problem as follows.
Problem 2 (Minimum Index Selection Problem (MISP)). Given a set S of searches on a relation R, the minimum index selection problem is to find a set of indexes with the minimum cardinality such that all searches of S are covered by the index set, i.e., fS = arg min L:l-cover S (L)

|L|.
Example 3. Continuing Example 2, the set of searches in Table 1 is S = {x}, {x, y}, {x, z}, {x, y, z} . It can be covered by two indexes 1 = x ≺ y ≺ z and 2 = x ≺ z, which is shown in the sixth column of Table 1; this is smaller than the four indexes used in Example 2. Indeed, two is the smallest number of indexes to cover S, since it is easy to see that {x, y} and {x, z} cannot be covered by the same index.

COMPUTING THE OPTIMAL MISP
In this section, we propose an algorithm to solve MISP optimally in polynomial time. We begin with discussing the inviability of a brute-force approach.

Inviability of a Brute-force Approach
Before presenting our algorithm, we discuss the size of the search space of MISP. If it is very large, then a brute-force algorithm is not viable, especially for high performance engines.
Given a set S of searches on a relation R, let A be the set of attributes of R that are relevant for the searches, i.e., A = S ∈S S . We use LA to represent the set of all possible permutation/sequences that may be formed by the elements of A, i.e., LA = X⊆A,X =∅ Pm(X). Here, Pm(X) denotes the set of permutations of a set X which is the set of all possible sequences formed by all elements of X such that each element occurs exactly once. Now, we bound |LA|. Although constructing a closed form is hard, it can be bounded by the following lemma.
where the second equality follows from the fact that, for any X Note that, the absolute error of the over-approximation of |LA| is small, i.e., e · m! − |LA| = m! i≥m Recall that, MISP searches for the smallest subset of LA that covers all primitive searches on a relation. Thus, a brute-force approach would require to iterate through all subsets of LA. Then, the search space of a brute-force approach is 2 L A = {L | L ⊆ LA}, and its size is |2 L A | = 2 |L A | . Using the approximation of |LA| in Lemma 2, we obtain a complexity of O(2 e·m! ). Theorem 1. A brute-force approach for MISP exhibits a worstcase time complexity of O(2 m m ).
Proof. As discussed above, the time complexity of a brute-force approach for MISP is O(2 e·m! ). Then, this theorem follows from Sterling's approximation of m!. Note that, the approximation becomes more precise for a large m.
As a result, a brute-force approach becomes intractable very quickly. For example, for a relation with 4 attributes, a brute-force MISP algorithm has to test 2 64 ≈ 1.8 × 10 19 different subsets of LA for coverage and minimality.

Computing MISP via Chain Cover
In view of the inviability of a brute-force approach, we propose to solve MISP via computing a chain cover of the searches. In the following, we first formulate the minimum chain cover problem (MCCP) and prove that an optimal MISP solution can be obtained from an optimal MCCP solution. Then, we propose a polynomialtime algorithm MinIndex that solves MISP optimally.

Minimum Chain Cover Problem
We define a search chain C as a set of searches {S1, . . . , S k } that subsume each other and form a total order, i.e., C ≡ S1 ⊂ S2 ⊂ · · · ⊂ S k . A search chain is related to an index as follows.
Lemma 3. Given a search chain C = S1 ⊂ S2 ⊂ · · · ⊂ S k , we can construct an index to cover all searches of C.
Proof. We prove this lemma by constructing such an index that covers all searches of C. Let Si − Si−1 denote the set of attributes of Si that are not in Si−1. Then, it is easy to see that any index conforming with S1 ≺ (S2 − S1) ≺ · · · ≺ (S k − S k−1 ) is such an index, i.e., attributes of Si+1 − Si appear later than attributes of Si − Si−1. Note that, the attributes of S1 and the attributes of Si − Si−1 can be ordered arbitrarily, respectively, within their sets of attributes.
Following Lemma 3, we say that a search chain C covers all its searches, i.e., C covers S for every search S ∈ C. Then, we would like to know whether a set C of search chains can cover all searches in a search set S. We formalize this via the c-cover predicate.
Definition 5 (C-COVER). Given a set S of searches and a set C of search chains on a relation R, we define a predicate c-coverS (C) which is true if for every search S ∈ S, there is a search chain C ∈ C that covers S , i.e., c-coverS (C) = ∀S ∈ S : ∃C ∈ C : S ∈ C. Now, we are ready to define our minimum chain cover problem, which aims to find the smallest set of search chains to cover all searches in a given set of searches.
Problem 3 (Minimum Chain Cover Problem (MCCP)). Given a set S of searches on a relation R, the minimum chain cover problem is to find the minimum set gS of search chains to cover S, i.e., gS = arg min C:c-cover S (C)

|C|
The rational of defining MCCP is that given a set C of search chains covering all searches in a search set S, we can construct a set of indexes of cardinality |C| to cover S by following Lemma 3. Thus, the smaller the cardinality of C, the better.
Moreover, there is a one-to-one correspondence between solutions of MISP and solutions of MCCP, as proved by the following lemma.
Lemma 4. Given any search set S on a relation R, there is a oneto-one correspondence between search chains C that cover S and indexes L that cover S, such that |C| = |L|.
Proof. Following from Lemma 3, we know that given any set C of search chains that cover S, we can construct an index set of cardinality |C| to cover S. Thus, what remains to be proved in this lemma is that given any index , we can construct a search chain C to cover all searches that are covered by .
Given an index and a set S of searches, we let S denote the subset of S that are covered by . We will show that S is a search chain. Firstly, it is easy to see that for any S , S ∈ S , we have |S | = |S |. Secondly, following Corollary 1, we know that for any S , S ∈ S , we have either S ⊂ S or S ⊂ S , since the k-th prefix of is a subset of a (k + 1)-th prefix of . Thus, the lemma holds.
Following from Lemma 4, we have the following corollary, which states that we can obtain an optimal MISP solution from an optimal MCCP solution.
Corollary 2. Given any search set S on a relation R, an optimal MISP solution can be obtained from an optimal MCCP solution.

A Polynomial-time MISP Algorithm
We have shown in Corollary 2 that we can obtain an optimal MISP solution from an optimal MCCP solution. The good news is that MCCP can be solved optimally in polynomial time by the Dilworth's Theorem [15], which states that in a finite partial order, the size of a maximum anti-chain is equal to the minimum number of chains needed to cover its elements. An anti-chain is a subset of a partially ordered set such that any two elements in the subset are unrelated, and a chain is a totally ordered subset of a partial ordered set. Although Dilworth's Theorem is non-constructive, there exists constructive versions that solve the minimum chain cover problem either via the maximum matching problem in a bipartite graph [16] or via a max-flow problem [27]. Both problems are optimally solvable in polynomial time.
The general idea of computing a minimum chain cover for a search set S is as follows. Firstly, a bipartite graph GS = (U, V, E) is constructed such that there is a vertex in both U and V for each search S ∈ S, and there is an edge between S ∈ U and S ∈ V if S is a proper subset of S (i.e., S ⊂ S ). For example, Figure 3a Figure 3c.
Alternatively, we can view the edges of the bipartite graph GS as directed edges in a unipartite graph with S as the set of vertices, e.g., see Figure 3c with both solid and dotted lines. Then, the edges of a matching M of GS form |S| − |M| directed paths in the unipartitie graph (each corresponding to one search chain), since each vertex has at most one in-coming edge and at most one out-going edge due to the definition of matching. Specifically, each chain starts from a search that do not have any predecessors (i.e., in-coming edges) in the matching M. Moreover, the set of search chains constructed from a maximum matching of GS has the smallest cardinality. This is because, given any set C of nonoverlapping search chains of S, a matching of |S| − |C| edges can be constructed for GS , since each search chain C ∈ C adds |C| − 1 edges to the matching; two search chains are non-overlap if their sets of searches are non-overlap. Note that, for any search set S, there is a minimum chain cover whose search chains are non-overlap, since a search chain remains valid after removing any search from it. As a result, a minimum set of search chains is constructed from a maximum matching of GS , where the pseudocode of the computation is shown in Algorithm 1. Finally, given the Add u 1 ⊂ u 2 ⊂ u 3 ⊂ · · · ⊂ u k−1 ⊂ u k to C ; 6 return C set C of search chains that is computed by Algorithm 1 for covering the search set S, a set L of |C| indexes can be constructed to cover S by following the proof of Lemma 3. For example, the search chain {x} ⊂ {x, y} ⊂ {x, y, z} is converted to the index {x} ≺ {x, y} − {x} ≺ {x, y, z} − {x, y} which is x ≺ y ≺ z, and the search chain {x, z} can be converted to either index x ≺ z or index z ≺ x. The pseudocode for such a conversion is shown in Algorithm 2, and denoted by MinIndex.

Algorithm 2: MinIndex(S)
Input: A set S of searches Output: A minimum set L of indexes to cover S 1 C ← MinChainCover(S); 2 Initialize L to be the empty set; 3 for all S 1 ⊂ S 2 ⊂ · · · ⊂ S k−1 ⊂ S k ∈ C do 4 Add to L an arbitrary index conforming with The correctness of MinIndex (Algorithm 1 and Algorithm 2) follows from the above discussions. Let m be the number of distinct attributes in S; note that, m is at most the number of attributes in a relation. Then, the time complexity of MinIndex is bounded by the following theorem. Proof. The time complexity follows from the facts that, constructing the bipartite graph G takes O(|S| 2 · m) time, computing the maximum matching in G takes O(|S| 2.5 ) time, and both constructing chain cover from matching M and constructing indexes from chain cover take O(|S| · m) time.
Note that, as both |S| and m are not large in practice (e.g., they are at most hundreds), the running time of MinIndex usually is negligible compared with the total running time of a Datalog program.

INTEGRATING INTO SOUFFLÉ
We have implemented our index selection approach as the default indexing technique of the open-source Datalog engine SOUFFLÉ, which works as follows. It first translates a given Datalog ruleset (also called a program) into C++ code during the code generation phase, then it compiles the C++ code into binary executable code at the code compilation phase, and finally the code execution phase executes the binary code on the EDB (i.e., input facts) to compute the IDBs. For more details of SOUFFLÉ, please refer to [21,33].
Index selection occurs in the code generation phase, which also performs several rewrite transformations. In the first step, a query translator converts each rule of an input Datalog program to a nested loop join. It selects the best loop order, minimizing the iteration space of the nested loop join with the aid of a query planner [1] or user hints. In the second step, primitive searches (see Definition 1) are identified from the nested loop joins. In the last step, indexes are selected by our algorithm MinIndex to cover the primitive searches, and the primitive searches are replaced by index operations on relations based on the selected indexes.
The code execution phase of SOUFFLÉ is also divided into several steps. It ingests the whole factset (i.e., EDB) into main memory and stores them in EBD index structures. The binary code runs on in-memory structures, and repeatedly adds rows to the various IDB index structures which are initially empty. Finally, the computed IDB relations are output to disk. Note that, the first two steps are interleaved.

EXPERIMENTS
In this section, we evaluate our auto-indexing scheme by measuring an implementation of it, and also some alternative schemes, in a production-strength Datalog engine SOUFFLÉ [21]. The outcome of our evaluations is to validate the following claims.

Claim-I: Negligible Index Selection Overhead During Analysis.
The time taken for selecting the indexes using our autoindexing scheme does not substantially slow down the code generation and compilation phases compared to alternative indexing schemes. Claim-II: Significant Performance Benefit During Execution.
Our auto-indexing scheme provides a good combination of fast runtime evaluation and low memory footprint. Claim-III: Good Enough, Without Hand Optimizations. Our auto-indexing scheme delivers runtime evaluation speed and memory usage that compare well with what users have accepted as worth the effort of hand optimizations.

Experimental Setup
Our experiments were performed on an Intel(R) Core(TM) i7-7700K CPU at 4.20GHz with 64GB of physical RAM running Ubuntu 16.04.3 LTS on the bare-metal. The experiments were conducted in isolation without virtualization so that runtime results are

Compared Indexing Schemes
We compare the following three indexing schemes, all implemented by us in SOUFFLÉ.
• Auto: our auto-indexing scheme presented in Algorithm 2.
• Maximal: one index for each distinct search on a relation.
• Single: only one index for each relation. To choose the best index for a relation R for a given workload, we first count the frequency of each individual search S on R which is obtained by instrumenting the search pattern while executing the Datalog program once. Then, the best single index is selected as the one whose set of covered searches has the maximum total frequency. This can be computed in quadratic time to the number of searches by dynamic programming (cf. Chapter 12, [13]); we omit the details.
Intuitively, these two alternative indexing schemes, Maximal and Single, should be especially good for the execution speed and the memory efficiency, respectively. However Maximal uses much more memory for the numerous indexes, and Single doesn't cover every search and thus could be very slow in evaluating the program. Our experiments in Section 7.2.2 validate these expectations, and show that Auto offers an excellent compromise, with runtime similar to Maximal and much less than Single, and using memory similar to Single and substantially less than Maximal.
To aid in understanding the implications and overheads of the indexing in SOUFFLÉ, we also include some measurements for two radically different approaches. In the code analysis steps, we consider an scheme we call None, in which there is no work done to choose indices based on the searches to be performed; instead the system stores each relation with the single index that is determined by the lexicographic order of the attributes in the relation. This establishes a baseline for seeing the overhead of the work done in any approach that examines the set of searches in order to create suitable indices. For the execution phase, we have implemented what we call Hash, where the relations (both EDB and IDB) are stored using the STL hash map. As a hash map cannot be shared between two different searches, Hash builds one hash map for each distinct search in the same way as Maximal. We discuss the implications of this below.
In addition, for the workloads of one use case from program analysis, we also compare our auto-indexing scheme in SOUFFLÉ to another Datalog system PA-Datalog, an optimized Logicblox Ver.3 for program analysis. The ruleset used with PA-Datalog has been heavily hand-optimized through months of work by experts, specially for the use case. Because these are different engines, the comparison of speed and memory is not truly apples-to-apples. Nevertheless, we will illustrate in Section 7.2.3 that our autoindexing scheme in SOUFFLÉ results in better performance without human optimizing effort, compared to what users have accepted as sufficiently good to justify the effort of hand-optimization.

Case Studies
We perform our evaluations using two real-world case studies: namely, a cloud security use case and a program analysis use case. These use cases are of very large scale, where the Datalog programs contain hundreds of rules and relations and produce gigatuple output relations.
Use Case-I: Cloud Security Analysis. The first use case is to analyze the security of Amazon networks. In this industrial use case, a Domain Specific Language (DSL) is used to describe security properties of networks and to query about the networks. A translator automatically converts the security specifications and queries in a DSL to a Datalog program where the properties of the given networks are encoded as EDBs (i.e., input relations). The generated Datalog programs are unoptimized, since the DSL doesn't offer annotations for hand-crafted optimizations such as enforcing good indexing schemes. It is worth pointing out that this use case has resource constraints including a low memory footprint and runtime limitations, as imposed by running the security analysis as a service in Amazon Lambda [35].
For this use case, we consider three security analysis workloads (i.e., three Datalog programs), each encoding specific security properties and security queries. We name these three programs as sec1, sec2, and sec3, where the numbers of rules and relations of these programs are shown in Table 2. At execution time, the programs run on five network factsets that vary in complexity: networks N1075 and N2340 have less complexity whereas networks N3500, N3511, and N9087 are more complex in terms of their network connectivity. The EDB sizes (i.e., total number of tuples in all input relations) of the five network datasets are summarized in Table 3.
Use Case-II: Program Analysis. The second use case is DOOP program analysis that performs points-to analyses for Java programs; DOOP is publicly available and open source [37]. Specifically, a Java program is encoded as an EDB (i.e. input relations) and the points-to analysis is expressed as a Datalog program. DOOP's points-to analysis has been used to analyze very large libraries such as the Oracle JDK [21]; as a result, it requires very fast execution and low memory footprints in order to be solved in a feasible time and with feasible resources. The DOOP analysis workloads have different parameterizable precisions, which (d) Index distribution of sec3 Figure 6: Index distribution for all rulesets depend on (1) how concrete Java objects are abstracted to a finite set of objects in a sound fashion and (2) how much context is stored for each variable. For example, a context could be a trace over last few call-sites or receiver object of a method call.
In our testing, we use three representative precision settings, 1object-sensitive+1-heap (1o1h), 2-object-sensitive+2-heap (2o2h), and 3-object-sensitive+3-heap (3o3h). Each of these precision settings corresponds to a Datalog program containing 496 relations and 469 rules. However, increased precision leads to larger relations due to added rule complexity. Each analysis program is applied at execution time to 8 factsets from the DaCapo06 benchmark suite [4], where the sizes of these factsets are summarized in Table 4.

Experimental Results
We present experimental results to validate our claims in the following three subsubsections.

Code Analysis Performance
Index Selection Overhead. In order to quantify the overhead of index selection in our Auto indexing scheme, we also implemented an indexing scheme (None) which trivially builds the index on a relation's attributes in order as they appear. The code generation time (gen) and the code compilation time (compile) for all the four indexing schemes, Auto, None, Single, and Maximal, for both use cases are shown in Figure 4. We observe that the code generation time for the four indexing schemes are almost the same. Recall that index selection occurs during code generation. Thus, the different index selection methods have little impact on the code generation time. This is because index selection takes a negligible portion of the code generation time, e.g., less than 1% for Auto; the time of index selection by MinIndex is shown in the second column of the table in Figure 4. Note that, here we did not include in the measurement for Single the extra preliminary activities that collect statistics information such as frequencies of searches.
On the other hand, different index choices may lead to different work in the code compilation phase too, and the more indexes whose construction needs to be compiled, the longer the compilation time. The main reason is that each index requires additional templatized comparator functions that the C++ compiler needs to unroll at template instantiation time. Thus, in this phase, None and Single are slightly faster than Auto and Maximal, as shown in Figure 4 and Figure 5. Nevertheless, the differences are not significant.
Overall, the time for code generation which also conducts index selection is negligible compared with the code compilation time, and our Auto indexing scheme does not substantially slow down the code generation and compilation time. It is also worth mentioning that the binary code is independent of the dataset and, once generated, it can be run on any input dataset (i.e., factset).
Distribution of Index Reduction. We analyze the number of indexes constructed for the various Datalog programs. Recall that, given a set of searches we compute the smallest set of indexes L to cover/speed up all searches of S, while the Maximal indexing scheme constructs one index for each search in S, and Single constructs one index for each relation. Thus, the reduction ratio for the number of indexes of Auto over Maximal will be upper bounded by |S| for a relation (which is the reduction ratio for Single over Maximal). The distributions of |S| among all relations that have at least two searches for the three cloud security analyses, sec1, sec2, and sec3, are shown as blue squares measured against the left-hand scale, in Figures 6b, 6c, and 6d, respectively. We can see that more than 50 percent of the relations have only two searches, and more than 80 percent of the relations have at most three searches; this means that for 80 of the relations, |L|/|S| is at least 1/3. In order to quantify the reduction ratio of Auto over Maximal, we define it as 1 − |L|/|S|. The distributions of the reduction ratio are shown as black line in Figures 6b, 6c, and 6d, measured against the right-hand scale. We can see that, for 25 percent of the relations, there is no reduction (i.e., |S| = |L|), for another 25 percent of the relations, the reduction is around 50%, and for the remaining 50 percent of relations, the reduction is between 50% and 70%. Finally, the distributions of the actual number of indexes |L| constructed for the relations are shown as red diamond in Figures 6b, 6c, and 6d measured against the left-hand scale. For cloud security analyses sec1 and sec3, the largest number of indexes constructed for a relation is only two, while the largest number searches on a relation is 9. For cloud security analysis sec2, the largest number of indexes constructed for a relation is three, while the largest number searches on a relation is 11. As shown in Figure 6a, similar results are also observed for the DOOP program analysis.

Evaluation-time Performance
To justify our choice of adopting a linear order-based index (i.e., B-tree index), we also implemented an indexing scheme that uses the hash technique, denoted by Hash. Specifically, Hash uses C++'s hash map to index relations. As a hash map cannot be shared between two different searches, Hash builds one hash map for each distinct search in the same way as Maximal.
Running Time. The time of the code execution phase for the four indexing schemes, Auto, Maximal, Hash and Single, running on the three cloud security analyses on networks is shown in Figure 7. The time is divided into loading time that loads factset/EDB from disk to main memory and builds indexes for the EDB, and executing time that repeatedly computes and adds rows to the IDB index structures. As there are some searches that can't exploit an index in Single, this takes an excessively long time (i.e., more than 24 hours, denoted by TLE) when given one of the three large networks: N3500, N3511, and N9087. Thus, Single is not suitable to process large-scale Datalog programs. The running time of Auto is almost the same as that of Maximal, and in fact Auto is often slightly faster. This is because execution involves both constructing the indexes for IDB as the facts are computed on the fly, as well  as doing the primitive searches. The latter aspect should in principle be the same for Auto and Maximal, but Maximal constructs more indexes than Auto. Hash is often much slower than Auto and Maximal, due to the inefficient data structure of hash map in practice. The ratio of the running time of Maximal, Hash, and Single with respect to Auto for sec1 is shown in Figure 8. The running time of the execution phase for Auto, Maximal, Hash and Single on the three DOOP program analyses are illustrated in Figure 11. The general trend is similar to that for cloud security analysis. Although Single can complete all the DOOP program analysis, it takes significantly more time than Auto and Maximal.
Overall, we find that Auto runs even a bit faster than Maximal, and it is significantly faster than Single and Hash. This validates our motivation to construct enough indexes so that every search is sped up, and to use linear order-based index structures.
Memory Usage. We evaluate the memory usage of Auto compared to Single, Maximal, and Hash. We define the memory usage improvement of an indexing scheme A over another scheme B as the ratio of memory usage or B compared to that of A.
The memory usages of Auto, Single, Maximal, and Hash for cloud security analyses, sec1, sec2, and sec3, are shown in Figure 9. We see that Single always consumes the smallest amount of memory, and Hash always consumes the largest amount of memory. The memory usage improvement of Auto over Maximal can be up-to 6, e.g., see Figure 10. The memory usage penalty for Auto compared to Single is at most two times. Figure 12 shows the memory usage of Auto, Single, Maximal and Hash for DOOP program analysis, where the memory usage improvement of Auto over Maximal is around 2, and Auto consumes only around 20% more memory than Single.
Overall, the memory usage of Auto is not far from that of Single, and better than Maximal and Hash.

Against PA-Datalog
We also measured the heavily hand-optimized PA-Datalog system for the DOOP program analysis, on the same hardware. The results are shown in Figure 11 and Figure 12. We see that Auto is faster and consumes less memory than PA-Datalog. The running time improvement ranges from 3-5x, and the memory usage improvement ranges from 2-5x. This demonstrates that our Auto indexing scheme works well enough, automatically getting performance that previously required extensive hand optimization.

Hashing
As mentioned, the SOUFFLÉ engine uses a linear B-tree index structure for each relation. In light of the value of hash techniques in SQL database engines [5], we explain why those ideas are not used in SOUFFLÉ. Note first that the traditional hash-join, where each input is partitioned by a hash on the join attribute, and then the output is produced in hash buckets, is not suitable for the multiway joins involved in many Datalog rules, as we can't afford the space for materializing the whole intermediate relation (which will be the input to the next stage of the join, and thus needs to be hashed itself). So we are left with doing a nested loop join, where each relation is stored as a hash-index. This is what we measured above, as Hash, and found it was not competitive. We reported experiments with the STL hash map, but we also explored other hash implementations such as Google sparse hashmap implementation, but none were notably better in memory consumption and runtime at the same time. For example, Google's sparse hash-set implementation has the issue that it would not permit multi-sets which are essential for storing different tuples with the same key in the index. Independent of this issue, micro-benchmarks for Google's sparse hashmap indicate that the memory consumption reduces at most by a factor of two. However, the runtime of the query execution would increase substantially. Their dense version of the set would have the reverse effect. Hence, the STL's unordered multiset is a good compromise in terms of runtime and memory consumption. It is also the case that hash implementations typically don't paral-

Summary.
Overall, the experimental results demonstrate the value of our Auto indexing scheme for large-scale Datalog computation. During analysis, there is little extra work; and in executing, Auto runs with speed similar to (even faster than) Maximal, and using slightly more memory than Single. As Single is often too slow, and Maximal uses much memory, our Auto gives a good approach for processing large Datalog program such as those from program analysis, without needing the effort of hand optimization.

EXTENSIONS
Single Inequality. Although we limited the search predicate in our primitive search to be equalities of left-hand-side attributes and right-hand-side constants, our techniques can be extended for inequality constraints on one attribute: First, the bounds of the lex search predicate are to be adapted for the attribute of the inequality. Second, the attribute has to be the last one among the attributes in the search with respect to the lexicographical order. The ordering restriction is encoded in the bipartite graph G = (U, V, E) by omitting edges in the standard construction. Specifically, there is an edge between S ∈ U and S ∈ V if (1) S is a proper subset of S , (2) S has no inequality, and (3) if S has an inequality on attribute x, then S does not have x. However, if there are multiple inequalities in a search, other techniques will be needed.
Loop Scheduling. Some Datalog engines such as Logicblox version 4 [24] use a leapfrog join that, while requiring users to specify indexes manually, alleviates users from specifying join order. Integrating our technique into such an engine is not obvious as we assume a fixed literal order before our technique is applied. Typically, this can be manually identified using a profiler, or automatically using heuristic techniques [34]. During performance tuning of large Datalog programs, we have observed that only a few rules require manual loop scheduling. Therefore, our preference is to fix loop orders rather than indexes for a better user experience.
Nevertheless, it will be an interesting future work to integrate automatic loop scheduling and automatic indexing selection.

CONCLUSION
We presented an automatic indexing scheme for large-scale Datalog applications that typically consist of hundreds of Datalog rules and millions of relation tuples. Such use cases could not previously be computed by state-of-the-art Datalog engines without considerable hand-crafted optimizations. We have formally defined the minimum index selection problem, aiming for a low memory footprint while still allowing every search to be sped up, and we proposed an algorithm to optimally solve this problem in polynomial time. Our technique has been implemented in the SOUFFLÉ Datalog engine, and measured to give fast speed and low memory. Our automatic indexing scheme releases endusers from a daunting obligation to carefully annotate Datalog programs, and it delivers comparable, even improved, performance to what they have accepted from making such efforts.