S TACK AND D EAL : A N EFFICIENT ALGORITHM FOR P RIVACY P RESERVING D ATA P UBLISHING

Although k-Anonymity is a good way to publish microdata for research purposes, it still suffers from various attacks. Hence, many refinements of k-Anonymity have been proposed such as l-diversity and t-Closeness, with t-Closeness being one of the strictest privacy models. Satisfying t-Closeness for a lower value of t may yield equivalence classes with high number of records which results in a greater information loss. For a higher value of t, equivalence classes are still prone to homogeneity, skewness, and similarity attacks. This is because equivalence classes can be formed with fewer distinct sensitive attribute values and still satisfy the constraint t. In this paper, we introduce a new algorithm that overcomes the limitations of k-Anonymity and l-Diversity and yields equivalence classes of size k with greater diversity and frequency of a SA value in all the equivalence classes differ by at-most one.


INTRODUCTION
Various organizations such as government agencies and hospitals release microdata for medical research, trend analysis, and other purposes. Typically, microdata is stored in a table and each row corresponds to an individual's record and each record consists of a diverse number of attributes. These attributes can be categorized into a)Explicit Identifier attributes: are attribute sets such as name and social security number, that explicitly identify individuals. b) Quasi Identifier (QI) attributes: are attribute sets such as zip code, age, and sex that cannot uniquely identify individuals, but combinations of these attributes can give away the record holder. Sweeney [1] has shown that even though neither sex, date of birth, nor zip codes uniquely identifies an individual, the combination of all three is sufficient to identify 87% of individuals in the United States. c) Sensitive attributes (SAs): consists of sensitive information of individuals. d) Non-Sensitive attributes: consists of attributes that are non-sensitive in nature which does notreveal any sort of information about the record holder.
Privacy preserving data publishing (PPDP) means releasing microdata in such a way that there is data utility of released data and at the same time privacy of an individual in the released data is maintained. Prior to data release, first, the explicit identifier attributes are removed since it uniquely identifies an individual. Then the records are horizontally partitioned into groups of records called equivalence classes and the quasi identifier attributes are generalized to ensure that quasi identifier values of all records within an equivalence class becomes identical while the sensitive attributes are unaltered.
Based on this approach, various privacy models have been proposed. For example, k-anonymity (Sweeney [1]) requires that each equivalence class must have at least k records that are indistinguishable from k-1 records in terms of their quasi identifier attribute values. l-diversity (Machanavajjhala et al. [2]) requires that each equivalence class consists of at least a certain number of i.e., l "well-represented" values of sensitive attributes. To address the limitations of kanonymity and l-diversity Li et al. [3] introduced the concept of t-closeness [9], which requires that distance between the distribution of the sensitive attribute in the entire table and the distribution of the sensitive attribute in any equivalence class to be close.
l-diversity and t-closeness privacy models are the extensions of k-anonymity model to address its limitations. This paper shows that the limitations can be addressed with an algorithm since the extensions possess its own limitations. The algorithm outputs equivalence classes with a high degree of diversity among the sensitive attributes whose distribution is very close to the distribution of sensitive attributes in the overall table with just one input parameter k. The algorithm can be implemented with the help of simple data structures like queue or stack.

Contributions and Organization
In this paper, we have introduced an algorithm which gives equivalence classes whose sensitive attribute distribution is close to sensitive attribute distribution in the overall table and overcomes the limitations of k-Anonymity and l-Diversity. The rest of the paper is organized as follows. In Section 2, we review some background concepts used throughout the paper. Section 3 deals with our proposed method that works in various stages and provides the algorithm for obtaining equivalence classes of size k with greater diversity and frequency of a SA value in all the ECs differ by at-most one. In Section 4, we analyse the algorithm and show how it defends against homogeneity, skewness and similarity attacks with experimental results and Section 5 presents conclusion and future work.

BACKGROUND
Consider a raw data that needs to be published as shown in Table 1. Explicit identifiers such as name and SSN are removed since they directly identify the record holder. Quasi identifiers like zip code and age cannot uniquely identify individuals but, combinations of these attributes can give away the record holder. Sweeney [1] has shown that even though neither sex, date of birth nor zip codes uniquely identify an individual, the combination of all three is sufficient to identify 87% of individuals in the United States. Attribute like disease that is closely guarded by the record holder is considered to be sensitive attribute. The goal of PPDP is to protect the sensitive attribute of the record holder while still publishing enough information to maintain data utility. k-anonymity by Sweeney [1] is a well-known model for anonymizing the data. Here the explicit identifiers of each record are removed and quasi identifiers along with sensitive attribute are grouped. Each group is called an equivalence class where quasi identifiers are generalized and sensitive attribute is unaltered.
Definition 1: (Equivalence Class) An Equivalence Class is a set of anonymized records that have same values for all quasi identifier attributes, i.e., all records in each equivalence class are indistinguishable in terms of their quasi identifier attributes.
Definition 2: (k-Anonymity) An equivalence class is said to satisfy k-anonymity if every record is indistinguishable from at least k-1 other records with respect to every set of the quasi identifier attributes. A table is said to satisfy k-anonymity if every equivalence class of the table satisfies kanonymity.
In other words, it is like hiding something in the crowd so it would be difficult to identify, as almost everything looks alike when the entire crowd is seen. Table 2 gives a 3-anonymous version of the raw table. The data is divided into three equivalence classes consisting of three records each, whose quasi identifiers (zip code and age) are generalized and sensitive attribute (disease) is unaltered. Attack on k-Anonymity: Suppose that Alex and Bob are neighbours and Alex discovers a published data as shown in Table 2. Alex knows that Bob is a 29-year old male living in zip code 47677, then Alex can easily place Bob in first equivalence class. Since all the record holders in first equivalence class of Table 2 have the same disease i.e., flu, Alex concludes that Bob has flu. This is known as homogeneity attack.

Limitations of k-Anonymity:
1. Does not provide protection against homogeneity attack. 2. Does not include randomization and attacker can still make inferences about data sets that may harm individuals. 3. Not good for high dimensional data. 4. Concerned only about quasi identifiers and not sensitive attribute.
Machanavajjhala et al. [2] introduced l-diversity as a stronger notion of privacy to overcome the limitations of k-anonymity. Definition 3: (l-Diversity) An equivalence class is said to satisfy l-diversity if there are at-least l "well represented" values for the sensitive attribute. A table is said to satisfy l-diversity if every equivalence class of the table satisfies l-diversity. Attack on l-Diversity: Suppose that Alex and Bob are neighbours and Alex discovers a published data as shown in Table 4. Alex knows that Bob is a 37-year old male living in zip code 67220, then Alex can easily place Bob in first equivalence class. Looking at the SA values, Alex concludes that Bob is suffering from some sort of stomach related disease. This is known as similarity attack. l-diversity fails to protect against attacks arising from an adversary's unavoidable knowledge of the overall distribution of SA values in a released table. A skewness attack may occur when the distribution of sensitive attributes in an equivalence varies significantly from that in the released table. Limitations of l-Diversity: 1. Does not provide protection against similarity and skewness attacks.
2. l-diversity may be difficult and unnecessary to achieve.

3.
It is concerned only about well represented sensitive attributes but not about the distribution of the sensitive attributes.

ALGORITHM FRAMEWORK
In this section, we present a framework for Stack and Deal algorithm. Given a microdata tableM consisting of r records and n attributes ((n-1) quasi identifier attributes and one sensitive attribute) and k, let A denote the set of all attributes {A1, A2, …., An}. Without loss of generality, let the attribute An be the sensitive attribute and {A1, A2, …., An-1} be quasi identifier attributes.

Stage 1: Frequency and Distribution of SA in the entire table M
A frequency table as shown in Table 5 is created that contains s sensitive attribute values (S1, S2, S3, …., Ss) and its frequency F = (f1, f2, f3, …., fs) in the entire table. .

Stage 2: Stack and Deal the records
In this stage, a queue of records are stacked according to the frequency distribution table as shown in Table 6 i.e., all records having sensitive attribute value S1 appears at the top of the queue and records having sensitive attribute value Ss appears at the bottom of the queue. Now for dealing part, each record is popped out of the stack into e equivalence classes (e = r/k) in a cyclic order. For example, if there are ten equivalence classes then, the first record goes into first equivalence class, second record to second equivalence class and so on. When we hit the last equivalence class i.e., tenth equivalence the next record goes into the first equivalence class and the cycle continues till the stack is empty.
Observation: We see that, by following the cyclic order while populating equivalence classes we get equi-sized equivalence classes where every equivalence will get equal portions of fj/e and frequency of a SA value in all the equivalence classes differs by at-most one.

=1
. Distribution table of one equivalence class is shown below in Table 7. Earth movers distance [8] between P and Q gives the closeness between SA distribution in the overall table and the SA distribution in each equivalence class.

ANALYSIS OF ALGORITHM FOR VARIOUS ATTACKS
In this section, we show how the Stack and Deal algorithm protects against various attacks:

Protection against Homogeneity attack:
Homogeneity attack occurs when the SA values in an EC are the same, thus an attacker learns about the sensitive information of a record holder without any additional efforts. The way to combat this is to ensure that the SA values in every EC are diverse. Our algorithm ensures that all the ECs produced are diverse in terms of their SA values.
Let F = (167, 153, 127, 103, 91, 89) and r =730. When we vary the value of k we observe that we attain maximum diversity for k = 9. We know that if an EC satisfy 9-anonymity it also satisfies 2, 3, ..., 8-anonymity as well. Since there is a trade-off between privacy and data utility, we can compromise data utility to achieve maximum diversity. Figure 1 shows the variation of k with respect to l.
We run the same experiment on Adult data set Figure 2 from UC Irvine machine learning repository and vary k from 2 to 21. We observe relatively similar behaviour on this data set too.

Protection against Skewness and Similarity attacks:
Privacy is measured by the amount of information gain of an observer/attacker. The observer has some prior belief (G0) about the sensitive information of a record holder and some posterior belief (G2) after seeing the released table. Information gain is the difference in these two believes. Assume that the observer is given a completely generalized form of the data P and his prior belief (G0) changes to (G1) by looking at the distribution of SA values in the overall table P (P is considered as public information because as long as a version of data is released, P will be known). Now, the observer is given the released data and by knowing the quasi-identifier of a record holder, the observer is able to identify an EC to which the record holder belongs to and learns the distribution of SA values represented as Q in that EC. Now this is the observer's posterior belief (G2).
The l-diversity requirement is inspired by restricting the difference between prior belief and observer's posterior belief but, whenever the distribution of SA values within an EC varies significantly from their overall distribution in the released table. l-diversity fails to guarantee privacy allowing skewness and similarity attacks. In our method, we choose to limit the difference between (G1) and (G2). We can do this by ensuring that the frequencies of SA values in all the ECs are similar and limiting their difference to be as low as possible. This is because, we want to obtain ECs which are of equi sized so as to limit the information loss and if the difference in frequencies increases, $Q$ moves further away from P. Thus, by limiting the difference in the frequencies of SA values in the EC, we can limit the difference between P and Q and there by finally limiting the gain from (G1) to (G2). The distance between these two distributions is calculated using earth movers distance [8].
The earth mover's distance can be thought of as the sum total of the portions of the pi values that needs to be moved to other indices in P each portion scaled by the normalized distance of its movement within the m-tuple, to turn P into Q.
As an example, consider probability distributions, To study the result, we plot k against EMD between P and Q of ECs generated using our algorithm and randomly generated ECs. We observe that difference between Pand Q reduces as we increase k and our algorithm gives the minimum difference. Figure 3 represents the plot for F = (167, 153, 127, 103, 91, 89) and r=730 and varying k. We observe that for k=2 we get some ECs whose difference between P and Q is lesser than our algorithm, this is because the size of ECs for such values vary by a huge difference increasing the information loss. Next, let us study the effect of increasing the difference between SA values in the ECs. For this purpose, we use Blood Transfusion data set and Haberman's Survival data set from UC Irvine machine learning repository and vary k from 2 to 20. Rand1 and Rand2 are the set of ECs whose difference in frequency of SA values are 2 and 3, respectively. From Figure 4 and Figure 5 we observe that by limiting the difference in the frequencies of SA values in the EC we can limit the difference between P and Q and thereby finally limiting the gain from G1 to G2.

CONCLUSION AND FUTURE WORK
While k-Anonymity protects against identity disclosure, it does not provide sufficient protection against attribute disclosure. l-Diversity seeks to solve this problem by adding a condition that each equivalence class must have l distinct SA values. We have seen the limitations of l-Diversity and how we can combat them with the help of our algorithm without the requirement of t in t-Closeness. We have introduced a new algorithm that takes the input parameter k along with the microdata and produces equivalence classes of size k with a greater diversity and frequency of a SA value in all the ECs differ by at-most one thus helping in minimal data loss.
The first direction of future work is to design an algorithm that exchanges records to minimize information loss till we reach an optimal value for the information loss by making use of the parameter t. As a second direction, this algorithm can be generalized for Multiple Sensitive Attributes.