Diophantine inferences from statistical aggregates on few-valued attributes

Research on protection of statistical databases from revelation of private or sensitive information [Denning, 1982, ch. 6] has rarely examined situations where domain-dependent structure exists for a data attribute such that only a very few independent variables can characterize it. Such circumstances can lead to Diophantine (that is, integer-solution) equations whose solution can lead to surprising or compromising inferences on quite large data populations. In many cases the Diophantine equations are linear, allowing efficient algorithmic solution. Probabilistic models can also be used to rank solutions by reasonability, further pruning the search space. Unfortunately, it is difficult to protect against this form of data compromise, and all countermeasures have disadvantages.

sider a relation with attributes (name, age, salary) supporting statistical queries of the form "give me the sum of salaries of all individuals whose age x satisfies condition C(x)," where C is an arbitrary predicate on the domain of age, such as 30 ≤ x ≤ 40. Assume further that the projection (name, age) is publicly available, but the attribute salary is confidential. What measures suffice to protect the confidentiality of the salary information?
This is the classical statistical database security problem, studied extensively since the 1970's; see [1] for a survey. The main approaches to this problem involve perturbing the data so as to maintain their statistical characteristics but prevent their compromise [13,11,16], to perturb the responses for the same purpose, [2,8], to restrict the size or overlap of the statistical queries [10,9], or, finally (and closer to our concerns here), to audit the statistical queries in order to determine when enough information has been given out so that compromise becomes possible [3,4,5,12].
Most of the work in this area assumes that the confidential data are real-valued and essentially unbounded. In certain important applications, however, data may attain discrete values, or have maximum or minimum values that are fixed a priori and frequently attainable. In these cases, traditional methods for maintaining security are inadequate. For example, if a statistical query only samples minimum values (e.g., if it so happens that all individuals whose age satisfies C(x) are paid the minimum legal salary), then individual values are obviously compromised. Discreteness of values has even more subtle effects. Boolean attributes, of course, combine the problems of discrete and bounded variables; for example, consider a relation (name, age, hivpos), where the last attribute has values restricted to 0 or 1. Sum queries are again allowed.
The mathematical roots of the problem lie in the fact that linear Diophantine equations are more restricting -and have greater complexity-than linear equations. Permission to make digital or hard copies of part or all of this work or personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. POD 2000, Dallas, TX USA © ACM 2000 1-58113-218-x/00/05 . . .$5.00 For example, the system has a unique solution x = y = z = 1 2 , but no 0-1 (or integer) solution. Consequently, the system is secure if the variables are real (because in this case it has a one-dimensional continuum of solutions), but not if they are Boolean, because in the latter case the values of all variables are determined. Evidently, Boolean attributes make the auditing problem much more tricky. This added complexity of integer variables had been identified in the literature [14], albeit with no analytical exploration of the issue.
The present work. In this paper we explore the novel mathematical and algorithmic problems arising when one tries to audit statistical queries on Boolean attributes. (We also study a "dual" situation, in which the data is continuous but the query discrete; see below.) We consider a setting in which we have a collection of (secret) Boolean variables, and the results of some statistical queries to this set. Each such query simply specifies a subset S of the variables; the value returned in response to this query is the sum of the values of all variables in S. We want to decide whether the value of any of the Boolean variables is determined by the results of these queries. In other words, the collection of responses to the queries defines a system of equations as above, and we want to know whether there is any variable x i so that x i has the same value in every solution to this system of equations. One can view the value of this variable as having been compromised by the results of the queries. We call this the auditing problem. A natural variant of this problem is to place a stronger auditing requirement on a set of queries: for some number k ≥ 1, there is no set T consisting of at most k of the variables for which the sum of the values has been determined. Our basic auditing problem is then simply the case k = 1.
Remark: In more traditional work on auditing, there is a more subtle and generic variant of the auditing problem, in which one asks not whether the given set of queries compromises security for the present values of the variables, but for any values of the variables. In the case of Boolean-valued variables -or, indeed, variables over any bounded domain -this kind of generic auditing is impossible: For any set S of Boolean variables, there is a set of values in which all these variables are 1. Thus, if this query S is asked to that database, an attacker would compromise all the values in S. (This is not unlike the minimum wage example above). Therefore, an auditing system for Boolean attributes should, with small probability, refuse to answer any query submitted.
Our first result on the auditing problem for Boolean values is that it is coNP-complete. 1 It follows from our proof that it is even NP-hard to distinguish between a case in which no variable is determined, and a case in which all variables are. The generalized problem in which no sum of up to k variables is to be compromised is also NP-hard, and likewise for the variant in which we ask whether a specific variable x p is compromised.
It is natural to ask whether these hardness results hold only for "pathological" sets of queries. If we consider a collection of individuals specified by tuples of attribute values -all of the attributes public with the exception of (one or more) secret Boolean attributesthen selecting subsets of individuals via range queries on their public attributes is a well-defined and natural class of "reasonable" sets of queries. In other words, we are interested in instances of the auditing problem in which the variables correspond to points in R d , and the query sets are the intersections of these points with d-dimensional boxes. We will refer to such instances as the special case of d-dimensional range queries.
Our next result is a simple, polynomial-time combinatorial algorithm for auditing one-dimensional range queries, using techniques from combinatorial optimization [15]. We also show the auditing problem is coNPcomplete even for two-dimensional range queries (and hence for any d ≥ 2).
For the general Boolean case, we also describe a simple and efficient method that approximates the auditing problem, in that it successfully preserves the security of individual values -even though it may refuse to answer queries that could be answered without compromising security (by the remark above, this last point is inherent to the problem). Our technique is akin to the partitioning approach to data security [6].

MAX Queries.
We also consider a "dual" variant, in which the data is continuous, but the aggregate function is combinatorial in nature. In this variant the sensitive data is real-valued, and the aggregate function is max rather than summation. That is, we are given a set of real-valued variables, and each query returns the maximum value over a designated subset of the variables. Again, we ask: Is the value of any variable determined by the responses to these queries? We provide a simple and efficiently implementable characterization of the auditing condition in this case.
Recall the generic auditing condition discussed above -given a collection of query sets, does there exist a set of values for which some variable would be determined? In contrast to the Boolean case, this question becomes non-trivial in the case of max queries over real-valued data, and raises issues of a technically distinct flavor from the main auditing problem we study. We provide a characterization of query sets that are secure, in this generic sense, when the aggregate function is max.

Complexity
Define the Boolean auditing problem to be the fol- Proof: It is well-known that determining whether a system of linear equations has a 0-1 solution is NP-hard even if all coefficients are 1, the right-hand side of each equation is 1, and there are at most three variables per equation. We start from this problem. Given such a system of equations, we first replace each variable x by the expression x 1 + x 2 + x 3 − 1, and add the equations x 1 +x 2 +x 3 +x 4 +x 5 = 2, x 1 +x 1 = 1, The meaning of these equations is that x 1 + x 2 + x 3 is either 1 or 2, and thus x 1 + x 2 + x 3 − 1 is either 0 or 1, and therefore the latter expression can safely replace the Boolean variable x, but the new variables are never determined as there are always several ways to achieve the same value. Once these replacements have been made, the right-hand side of each equation is an integer no larger than 4. We introduce now 4 new variables a, b, c, d bound to be equal by the equations a+a = 1, a +b = 1, b +b = 1, b +c = 1, c + c = 1, c + d = 1. We finally add to the left-hand side of each equation (except for these last six involving a, b, c, d) a number of the a, b, c, d variables equal to the right-hand side of the equation. This completes the construction.
Notice that now that the system always has a 0-1 solution, one obtained by setting a = b = c = d = 1 and all other variables 0. If this is the only solution, then the system is insecure, because the values of all variables are determined. It is easy to see that the only way for another solution to exist is for the original system to have a solution; in that case, it is not hard to prove that no variable is determined.
Notice, incidentally, that this proof also establishes that it is coNP-hard to distinguish between the case in which all variables are determined and the case in which none is; as a consequence, telling whether a specific variable is determined is also coNP-complete.
Let us call a family of finite sets d-dimensional if the elements can be identified with points in R d so that the minimum bounding box of each set in the family contains no other element besides those in the set. An instance of the Boolean auditing problem is d-dimensional if the family of sets in it is. For example, any instance resulting from the (name, age, hivpos) example described in the introduction, with conditions on the age of the form C(x) = ≤ x ≤ u, is one-dimensional. The following result suggests that the auditing problem remains intractable even in its 2dimensional special case.

Theorem 2.2 The Boolean auditing problem is coNP-complete even if the system is restricted to be 2dimensional.
It is clear that d-dimensional queries, with d > 2, can be no easier.

Proof:
We reduce the general case to the two-dimensional one as follows: Let S be a family of sets defining an instance of the Boolean auditing problem. Arrange the sets in S in some order, and for each consider the occurrences of each variable in it, also in some arbitrary order. Each one of these occurrences will be a separate variable in the new instance, with the i th occurrence of x denoted x i .
Assign now to these new variables a point in the 2dimensional plane, by assigning to the k th such variable the point (k, k), k = 1, . . . , Si∈S |S i |. Notice that, this way, the equations of the original system indeed involve a set of points whose minimum bounding rectangle contains no other point.
All we need now is to enforce the additional constraints stating that all new variables corresponding to the same variable in the original problem take the same value. We achieve this as follows: Suppose that (k, k) and ( , ) are two points corresponding to two consecutive occurrences of the same variable, say x i and x i+1 , respectively. We then introduce a new variable y i , with point (k, ), and equations x i + y i = 1, x i+1 + y i = 1. Obviously, these two equations force x i and x i+1 -and by extension all occurrences of x-to have the same value. Furthermore, the minimum bounding rectangles involving these two equations are the line segments [(k, k), (k, )] and [( , ), (k, )], which, indeed contain no other points corresponding to variables besides the two endpoints.
It follows that the resulting system is 2-dimensional, and equivalent, visà vis auditing, to the original one.

The One-Dimensional Case
We can, however, prove the following: The Boolean auditing problem for one-dimensional queries can be solved in polynomial time.

Proof:
We have variables x 1 , . . . x n corresponding to points arranged in this order on a line, while the sets in S correspond to intervals of the same line. Consider the characteristic vector a i of the set S i ∈ S, let A denote the matrix whose rows are equal to the vectors a 1 , . . ., a m , and let b denote the vector (b 1 , . . . , b m ). Note that A has the consecutive-ones property, in that the ones in each row are all consecutive. We let P denote the polytope It is well-known [15] that matrices with the consecutive ones property are totally unimodular, that is, all their square submatrices have determinants +1, −1, or 0. Consequently, each vertex of the polytope P defined by has 0-1 coordinates. Now, suppose we let P i,c , for 1 ≤ i ≤ n and c ∈ {0, 1}, denote the polytope obtained by intersecting P with the hyperplane x i = c. Each polytope P i,c also has the property that all its vertices have integer coordinates. Thus, the value of the variable x i is determined by the results of the queries if and only if exactly one of the two polytopes P i,0 and P i,1 is non-empty; indeed, each such polytope that is nonempty will have at least one vertex, and this vertex will constitute a set of Boolean values consistent with the results of all queries. Thus our problem reduces to determining integer solutions to the system of equations and inequalities where A has the consecutive-ones property; and the arguments above show that this can be solved by any polynomial-time algorithm for linear programming (see e.g. [15]).
However, there is a much more direct and efficient combinatorial algorithm to determine the solvability of this system of equations and inequalities in integers. First, we define a directed graph G as follows.
• Also, suppose the block of 1's in the j th row of A runs from column to p to column q > p; then we add arcs e j = (p − 1, q) and e j = (q, p − 1).
• We assign cost 1 to each arc a i , cost 0 to each arc a i , cost b j to arc e j , and cost −b j to arc e j . Now we claim that our initial system is solvable in integers if and only if the graph G has no negative-cost cycle; this latter condition can be tested in polynomial time via well-known combinatorial algorithms [7]. First, suppose the system is feasible, and let (x 1 , . . . , x n ) be a 0-1-valued vector that satisfies Ax = b. We define s 0 = 0 and s i = i j=1 x j for i = 1, 2, . . ., n. Now observe that for every arc (u, v) The numbers {s i } thus provide a certificate that G has no negative cycle.
Conversely, suppose that G has no negative cycle. Then we can compute a well-defined shortest path length s i from node 0 to each node i. Now, for i = 1, 2, . . ., n, define x i = s i − s i−1 . We claim that the vector x = (x 1 , . . . , x n ) satisfies the system Ax = b, 0 ≤ x ≤ 1. First, observe that each x i is an integer; and the sets of Second, the existence of the arcs e j and e j imply that Example: The following simple example illustrates the algorithm. Suppose the original matrix A had only one row, consisting of n 1's. Thus the vector b consists of a single number. We know the original system Ax = b, 0 ≤ x ≤ 1 to be feasible if and only if 0 ≤ b ≤ n, and we want the construction above to capture this. The graph we construct in this case is a bi-directed cycle; it consists of n bi-directed edges with cost 1 clockwise and 0 counter-clockwise, and a single bi-directed edge of cost −b clockwise and b counter-clockwise. Indeed, there is now a negative cycle in G if and only if b < 0 or b > n.

Approximate Auditing
We have seen that it can be computationally infeasible to determine the safety of a collection of arbitrary statistical queries put to a database containing secret Boolean variables. Given arbitrary query sets arriving incrementally over time, then, how should we proceed? How long is it safe to continue providing answers?
A promising approach is to consider relaxed versions of our basic safety predicate. Rather than deciding precisely whether any variable's value has been determined, we can compute a conservative approximation to this predicate: For any collection of query sets, we only answer a query when it is safe to do so, but we may refuse to answer a query even when the answer would not in fact compromise safety.
We now describe a particular conservative approximation which can be implemented very efficiently. Given a collection of Boolean variables x 1 , . . . , x n , and a sequence of query sets S 1 , . . . , S m , we define the trace τ (x i ) of x i to be the set {p : x i ∈ S p } ⊆ {1, 2, . . ., m}. In terms of traces, one can express a relaxed safety condition as follows. To implement the safety test implied by the Theorem, we maintain a bipartite graph G whose vertex set is the collection of variables. We join variables x i and x j by an edge if they have opposite values but the same trace. Theorem 4.1 implies that as long as this graph has no isolated nodes, no variable has been determined. Before any query sets have been presented, we have the complete bipartite graph; with each query set S p , we delete the edge ( Note the connections between this methodology and the partitioning approach to maintaining security [6]. Under the method we discuss here, one is essentially monitoring an adaptive partition on the variables that is successively refined by each query response; as long as the atomic sub-populations in this partition maintain a particular property, one can continue responding to queries.
It would be interesting to consider generalizing our approach to a hierarchy of successively stronger approximations to the true safety predicate; we leave this as a direction for future work.

Auditing Max Queries
We now turn to the problem of auditing max queries over real-valued data.
We have a set of variables with labels U = {1, . . ., n}; and each variable i ∈ U has a value y i ∈ R. We are also given a collection of sets S 1 , . . . , S m ⊆ U ; with each S p , we are given the maximum value over the variables in S p . We will denote this maximum by f(S p ).
The first problem we address is that of auditing: is the value of any of the variables in U determined by the information {(S p , f(S p ))}? We provide an efficient algorithm to compute the set of all variables i ∈ U that are determined.
For a variable i, let We will say that S p is i-extreme if i ∈ S p , and µ i = f(S p ). Note that for every variable i, there is at least one i-extreme set. Conversely, for every set S p , there is an i for which S p is i-extreme -specifically, consider a variable i ∈ S p for which y i is maximum. Now we claim Moreover, if S q is not -extreme, then the last inequality is strict. Thus, if S p is a set that is i-extreme but notextreme for any = i, we have y ≤ µ < f(S p ) for every y ∈ S p \ {y i }. It follows that y i = f(S p ), and hence the value of variable i is determined.
We now consider the converse direction. To begin with, note that the setting y = µ for each is consistent with all the values {f(S p )}. Indeed, µ ≤ f(S p ) holds for every ∈ S p ; and the variable for which y attains the maximum value in S p has µ = f(S p ). Now, suppose variable i has the property that for every i-extreme set S p , there is a variable (p) = i for which S p is also (p)-extreme. Suppose we take the setting of variables {y = µ } and then arbitrarily decrease the value of the variable i. We claim that all the values {f(S p )} remain the same. Certainly, no set S p with f(S p ) > µ i or f(S p ) < µ i will change in value; and for any set S p with f(S p ) = µ i , we have (p) ∈ S p with µ (p) = µ i , whence f(S p ) will retain the value µ i . Notice that, by the above result, there is a polynomial algorithm for auditing max queries.
Generic safety. We now consider the generic notion of auditing discussed in the introduction. Given a collection of sets S = {S 1 , . . . , S m } as before, we will call this collection of sets safe if for every setting y 1 , . . . , y n of the variables in U , there is no y i whose value is determined by the results of queries to S.
We now provide a characterization of safe set systems. First, consider the following property of a set system S = {S 1 , . . ., S m }: ( * ) There exists S p ∈ S, and S q1 , . . ., S qt ∈ S so that S p \ r S qr = 1.
We claim

Theorem 5.2 A set system is safe if and only if it does not have property ( * ).
Proof: To prove the easier direction, suppose that the set system S has property ( * ), and choose S p , S q1 , . . . , S qt ∈ S and j ∈ U so that S p \ (∪ r S qr ) = {j}. Suppose we define values for the variables so that y j = 1 and y i = 0 for all i = j. Then f(S p ) = 1 and f(S qr ) = 0 for s = 1, . . ., t; this implies that for any (y 1 , . . . , y n ) ∈ R n consistent with these values, y i ≤ 0 for i ∈ S p \ {j}, whence y j = 1. Conversely, suppose that S does not have property ( * ), and consider a setting y = (y 1 , . . . , y n ) ∈ R n of the variables that determines the value of some variable j. We now define a setting y = (y 1 , . . . , y n ) ∈ R n that is consistent with the answers to all queries, but for which y j = y j ; this will be a contradiction. Let W j denote the union of all sets in S that do not contain j. First, we set y j to an arbitrary value strictly less than y j . Now, consider each set S p ∈ S on which the value of f has changed; these are precisely the sets in which y j was the unique maximum. Since S does not have property ( * ), and S p \ W j is non-empty, it must have cardinality at least two, and hence contains an element i p = j. We choose such an i p and define y ip = y j . Finally, for any element k which is not equal to i p for any p in the above construction, we set y k = y k .
We introduce the following additional piece of notation: for a set S r , we use f(y|S r ) to denote the value f(S r ) under the setting y ∈ R n , and we use f(y |S r ) to denote the value f(S r ) under the setting y ∈ R n . We claim that f(y|S r ) = f(y |S r ) for r = 1, 2, . . ., m; this will conclude the proof. The crucial observation here is that if y i = y i for some i = j, then j appears in every set in which i appears. For if y i = y i , it must be that i = i p for some p in the above construction; since i ∈ S p \ W j , we conclude that i does not appear in any set which omits j. Note also that for such an i, y i > y i , since y j was the unique maximum in this set S p . Now, suppose by way of contradiction that f(y|S r ) = f(y |S r ) for some r. It follows that j ∈ S r , and hence f(y|S r ) ≥ y j . If f(y|S r ) > y j or f(y |S r ) > y j , then the maximum in S r must be attained by an element whose value did not change, and so f(y|S r ) = f(y |S r ). Thus, suppose f(y|S r ) = y j > f(y |S r ). But in this case, y j must have been the unique maximum in S r , and so there is an element i r ∈ S r \ {j} for which we set y ir = y j ; this contradicts our supposition that f(y |S r ) < y j .
We note that property ( * ) -and therefore safetycan be tested in polynomial time, as follows: For each variable j, we form the set W j consisting of the union of all sets in S that do not contain j. Then, for each set S p , and each j ∈ S p , we determine whether S p \ W j = {j}.