A Baseline for Attribute Disclosure Risk in Synthetic Data

The generation of synthetic data is widely considered as viable method for alleviating privacy concerns and for reducing identification and attribute disclosure risk in micro-data. The records in a synthetic dataset are artificially created and thus do not directly relate to individuals in the original data in terms of a 1-to-1 correspondence. As a result, inferences about said individuals appear to be infeasible and, simultaneously, the utility of the data may be kept at a high level. In this paper, we challenge this belief by interpreting the standard attacker model for attribute disclosure as classification problem. We show how disclosure risk measures presented in recent publications may be compared to or even be reformulated as machine learning classification models. Our overall goal is to empirically analyze attribute disclosure risk in synthetic data and to discuss its close relationship to data utility. Moreover, we improve the baseline for attribute disclosure risk from the attacker's perspective by applying variants of the RadiusNearestNeighbor and the EnsembleVote classifier.


INTRODUCTION
The technological advances of recent years led to an increase in the collection and storage of large amounts of data.Micro-data, i.e. data that contains information about e.g.individuals, is collected in domains such as health care, employment or social media.Similarly, there has been an increase in the capability and the interest to analyse data.Its release and distribution, however, bares the risk of compromising the confidentiality of sensitive information and the privacy of affected individuals.To comply with ethical and legal standards such as the EU's General Directive on Data Protection (GDPR), data holders and data providers have to take measures to prevent attackers from learning sensitive information from the released data, often referred to as statistical disclosure control (SDC).
In the case of micro-data, two possibilities of disclosure of sensitive information are widely considered.Identification disclosure happens when an adversary is able to conclude that a certain record in the dataset belongs to a certain individual.Attribute disclosure happens whenever the dataset allows the attacker to learn new information about a specific individual in question, e.g. the value of a certain attribute.Identification disclosure often leads to attribute disclosure, as every attacker's ultimate goal is to gain information on their victim.However, attribute disclosure can also happen without the attacker uniquely identifying the record of their victim in the dataset, e.g. by the matching techniques discussed in this paper.
In most cases, it does not suffice to remove directly identifying attributes (primary identifiers), such as names or social security numbers, from the data.To minimize disclosure risks, approaches like Differential Privacy [1] and k-Anonymity [14] have been developed.The reader may consult the survey [3] for a general overview on traditional privacy-preserving data publishing methods.
In this paper, we will consider an alternative disclosure control measure, namely the generation of synthetic data.One of the first applications is described by Rubin in [11], where multiple imputation is used to synthetically generate certain columns of datasets.An overview on more than 20 different scenarios is given in [12].An evaluation of the utility of synthetic data, generated by various tools, for supervised machine learning tasks, specifically classification tasks, is also given in [4].
In our experiment in Section 4, we use three recently published synthetic data generation tools: The Synthetic Data Vault has been developed in 2016 by N. Patki et al. at MIT, and is implemented in Python.It builds a model based on estimates for the distributions of each column.In order to preserve the correlation between attributes, the synthesizer applies a multivariate version of the Gaussian copula and, subsequently, computes the covariance matrix.For more details and an utility evaluation conducted by the developers, the reader may consult the original publication ( [7]).
The second tool we use is the DataSynthesizer, proposed in 2017 by H. Ping et al. and also implemented in Python.The user is able to specify one of three modes, namely 'random mode', 'independent attribute mode', or 'correlated attribute mode'.If the tool should preserve dependencies between the attributes, the last mode should be chosen.The tool then generates synthetic data based on a Bayesian network model learned from the original data.For extended SDC, DataSynthesizer uses the framework of Differential Privacy and offers the possibility to determine the amount of injected noise.More information on this method can be found in [8].
Finally, we use the synthpop [6] package for R, which has been created by B. Nowok et al. at the University of Edinburgh.Here, the default synthesis method is a CART (Classification And Regression Tree) algorithm.However, the user is able to specify a large number of parameters.Synthpop also contains a function for SDC 1 , which may be applied to the resulting synthetic dataset.
Usually, a distinction is made between fully and partially synthetic data.Fully synthetic data means that the whole dataset is synthesized, whereas partially synthetic data contains a mixture of synthesized values for sensitive and original values for nonsensitive attributes.In 2009, Reiter and Mitra [9] proposed identification disclosure risk estimations for partially synthetic data.In this paper, we consider attribute disclosure risks on fully synthetic data.The notion of identification disclosure is not in our focus, since fully synthetic records do not relate to original records in terms of a 1-to-1 correspondence.However, this does not exclude the possibility of attribute disclosure, for which it is supposed that the attacker knows the values of certain attributes of their victim (called the key variables) and wants to learn the value of some sensitive attribute (called the target variable).Approaches for measuring the related risk have been proposed by Reiter et al. [10] and by Taub et al. [15].The methods differ by the amount of the assumed background knowledge B = {A, S } of the attacker.A denotes the attacker's knowledge about records in the original (unsynthesized) dataset, and S comprises available information about the process of generating the synthetic data, like code for the synthesizer or a description of the used tools.Reiter et al.'s approach assumes a worst case attacker scenario, in which the adversary knows all entries in the original dataset except the target attribute value they want to learn.While the authors admit that this assumption may be viewed as overly conservative and unrealistic, they suggested that their measures offer a type of upper bound on the disclosure risks.Taub et al.'s approach, on the other hand, assume an attacker's behavior that does not rely on B at all, and is therefore feasible for A = S = ∅.The related research question asks for a baseline, for a lower bound on the attribute disclosure risk: given only the synthetic dataset and the values of certain key attributes, which procedures are always available to the attacker that may help him to learn the value of a certain target attribute?This question is of great importance for analyzing the general usefulness of data synthesis as privacy-preserving method.
Our main contribution is the generalization of Taub et al.'s approach, which is based on the concept of Correct Attribution Probability.The technique finds those records in the synthetic dataset which match a certain combination of key variables.For example, the attacker may know that this set of values belongs to a certain 1 https://rdrr.io/cran/synthpop/man/sdc.html individual in the original data.For the found synthetic records, the distribution of the value of the target attribute is computed, which allows to assign a risk probability for the exposure of the real value of the corresponding individual in the original dataset.However, it may happen that the distinct combination of key attribute values of some row in the original data does not occur in the synthetic data.While the original approach either ignores such non-matches or assigns probability 0, our generalization allows to extend the risk analysis to these records.In our evaluation, we demonstrate the merit of this approach and compare it to machine learning classifiers which the attacker might use to extract information from the data and obtain a prediction for the target variable of their victim.
The mentioned approaches exploit global, not local properties of the dataset.While arguments have been brought forward that for an attacker there is little additional knowledge to be gained from synthetic data that describes publicly well known correlations in data, we want to stress that the task of estimating attribute disclosure on fully synthetic data (or on corresponding models) is particularly relevant whenever the comprised information and the correlations in the original data are not publicly known.This is often the case for data about sub-populations and for business data.In general, our evaluation shows that the attacker is able to gain knowledge from the synthetic data that increases the accuracy of their predictions.
The remainder of this paper is structured as follows: In Section 2, we discuss related work and, on this basis, the relation between data utility and attribute disclosure risks by considering the attacker's situation as classification problem.In Section 3, we improve the baseline for attribute disclosure risk by generalizing the approach established in [15].In Section 4, we evaluate our approach and compare the performance of several machine learning models on the attacker's classification problem.Finally, in Section 5, we will draw our conclusions and describe ideas for future work.

ATTACKER'S CLASSIFICATION PROBLEM
It has already been mentioned that, for fully synthetic data, the notion of identification disclosure is not clear cut.From an attacker's perspective, the approach to gain information by linking certain synthetic records to individuals is not promising, as such links generally do not exist.Attribute disclosure, on the other hand, does not necessarily depend on such linkages.There are other ways to use data and prior knowledge for learning about a sensitive target attribute value, one of which will be analysed in the next section.Still, it seems highly unlikely that synthetic data can ever be used by the adversary to infer information with absolute certainty.In order to see this, assume that one of the records in the Adult Census Income dataset 2 belongs to our neighbor.Our prior knowledge consists of the values of the key attributes 'age', 'gender', 'race', 'occupation', 'marital-status' and 'native-country'.We are nosy and want to know if she earns more or less than $50K a year.Consequently, 'income' is our target attribute.We simply search for her combination of key attributes and find only one record in the whole dataset that matches all these values.We can be certain that this is the record of our neighbor, and may obtain the respective target value.Studies such as [13] have shown that with similar (actually fewer) attributes, a large majority of 87% of the US-residents can be identified, so this is a likely scenario.Even if we find more than one record with this combination, we may still be able to draw certain conclusions in the case where all of them have the same 'income' value (a situation that the concept of l-diversity [5] would address).If, however, we do not have the original dataset at hand, but just a synthesized version, the situation is quite different.It now can happen that no record comprises the values of our known key attributes.Even if there is a single record matching our prior knowledge, we cannot be certain about the entry of the 'income' attribute.As a result of the data synthesis, this target value -as well as the values of the other attributes not known by us -might deviate from our neighbor's real entries.We again face the same difficulties we already discussed in the context of identification disclosure: the record in question is not our neighbor's, nor is it our neighbor's synthesized record.In most cases, it is the product of randomized draws from a model described by global, not local, properties of the dataset.
In accordance with these considerations, attribute disclosure risk in synthetic data is measured by providing probabilities for the exposure of the real target value of records in the original data.We give an example by discussing Reiter et al. 's [10] already mentioned approach.Let D = {(x i , y i ) : i = 1, . . ., n} be the matrix comprising the original database, where x i is the vector of the i-th record's values of non-sensitive attributes, and y i is the vector of the i-th record's values of sensitive attributes which are subject to synthesis.Note that, for fully synthetic data, X = (x i : i = 1, . . ., n) is empty.By Z = (Z (1) , . . ., Z (m) ), we denote the m synthetic datasets generated by the data provider.Assume that an attacker wants to learn the vector y i for some record i in D. Let B = {A, S } be the background knowledge of the attacker.We recall that A consists of information about the original data and, for Reiter et al. 's approach, is set to A = {{(x j , y j ), for j i} ∪ x i }.Hence, it is assumed that the attacker knows the complete original dataset except the target value(s) of interest.Furthermore, we recall that S comprises knowledge about the synthesizer.Finally, let Y i denote the random variable representing the attacker's uncertain knowledge of y i .The sample space of Y i is given by all possible values of y i in the population.For evaluating a guess y * for y i , Reiter et al. assume that the attacker seeks the Bayesian posterior distribution , where the sum in the denominator is taken over all possible values y of y i in the population.Depending on the circumstances, a variety of techniques are proposed for estimating the prior distribution P(Y i = y * | X , A, S) and the probability P(Z | Y i = y * , X , A, S) of generating Z .For the first, one may either use a discrete uniform distribution or assume an adversary that already uses A to form prior beliefs.For the latter, importance sampling techniques are adopted and coupled with Monte Carlo simulation.
Based on the resulting value P(Y i = y * | Z , X , A, S), the data provider is able to compute several risk measures for the released synthetic dataset(s).One option mentioned by Reiter et al. is to compute where is the so called Iverson bracket.Subsequently, one may want to evaluate the disclosure risk of the complete dataset by deciding whether R = n i=1 R i /n is acceptably low.Another option would be to compare P(Y i = y * | Z , X , A, S) to the prior belief, e.g. by considering the multiplicative increase.
Correct Attribution Probability ( [15]), an idea discussed in Section 3, is rather different from the Bayesian Estimate described above.As mentioned in the introduction, no background knowledge B of the attacker is assumed.Clearly, this also restricts their possibilities of privacy violations.As a result of computing the disclosure risk measure, however, the data provider also obtains a distribution of all possible values of y i and may use the related percentage scores in the same way as P(Y i = y * | Z , X , A, S) to evaluate the overall disclosure risk.We may conclude that, from the attacker's perspective, both approaches provide means to solve the following task.
Attacker's Classification Problem: Given some background knowledge B, the synthetic dataset(s) Z and the values of key attributes of some record in the original dataset, obtain a prediction on the target value of said record.
The goal of this paper is to discuss the possibilities of the attacker to approach even the most restricted scenario of this problem, that is, when they have no background knowledge and only a single published synthetic dataset at hand.The purpose of this viewpoint is to establish a baseline, a set of tools for privacy invasion that is available to the adversary under all circumstances and that, on the flip side, should be always taken into consideration by data holders and data providers.
Clearly, machine learning models for classification are part of the attacker's toolkit, as these are directly applicable to the discussed situation.The question is: how well do models that are trained on the synthetic data perform, if they are applied back on the original data?Notably, a very similar question is often discussed in the context of the utility of the synthetic dataset.The answer depends on two factors: (1) How strong is the correlation between the key attributes and the target attribute?(2) To which degree does the synthesizer preserve the global properties of the original data, that is, the distributions of attributes and the dependencies between them?
If the correlation between sensitive variables and typical quasiidentifiers is strong in the original data, the only way to reduce disclosure risk is to conceal these dependencies in the synthetic data, e.g. by adding more noise in the process of synthesis.This will result in the loss of information and, hence, in a reduction of the utility of the synthetic dataset.For examples and simulations, we refer the reader to Section 4.
At first glance, the proposed viewpoint might appear counterintuitive.The information about the sensitive target attribute of the individual in question is not disclosed to the attacker by identifying the corresponding record or using some other local vulnerability of the data, but by considering and exploiting its global properties.However, if any tool available to the attacker results in high probability of exposure of the true target value of certain records, the privacy of affected individuals is clearly violated.Furthermore, for reasons already discussed, the focus on global properties lies in the nature of synthetic data disclosure risk assessment.In order to corroborate our statements, we will now discuss the relation between Correct Attribution Probability scores and one of the less well-known machine learning classifiers, namely the Fixed-Radius Nearest Neighbor search.

CORRECT ATTRIBUTION PROBABILITY
The concept of Correct Attribution Probability (CAP) has been introduced in [2] and elaborated on in [15] by J. Taub et al.In the first publication, M. Elliot used CAP to estimate disclosure risks of datasets generated by the synthpop [6] package in R, which was developed by the SYLLS Team at the University of Edinburgh.For assessing attribute disclosure risk, CAP assumes that the attacker knows the values of a set of key attributes for an individual in the original dataset, and wants to learn the respective value of some target attribute.CAP measures the disclosure risk of the individual's real target value in the case where the adversary has access to the synthetic dataset.In [15], the method is presented for a situation where the attributes in the key as well as the target attribute are all categorical.For the remainder of this section, we will keep this assumption.The reader is referred to [2] for a variant handling continuous target variables.
Consider a dataset consisting of micro-data with n records representing individuals and an unspecified number of attributes in the columns.For j ∈ {1, . . ., n}, let K o, j be the vector representing the values of the key attributes of the j-th record in the original dataset, and let T o, j be the corresponding value of the target attribute.Similarly, we define K s, j and T s, j for the synthetic dataset.The CAP score for record j in the original dataset is the empirical probability of its target value given its key attribute values, that is .
By indexing the probability P o (•), we indicate that our sample space is the original dataset.Additionally, we define the CAP score for the synthetic dataset, that is .
The basic idea is that the attacker is supposed to search for all records in the synthetic dataset that match the key attribute values known by them.This subset of data points is often referred to as equivalence class of K o, j .Inside this class, they then calculate the distribution of the occurring values of the target attribute.Clearly, CAP s, j corresponds to the proportion of the actual target value T o, j in this equivalence class.In this sense, CAP s, j measures the risk of disclosure of this information about the individual represented by the j-th record in the original data.In order to evaluate CAP s, j , the authors of [15] computed the mean value over all the records.Finally, they compared the result to the mean of CAP o, j as well as to the mean marginal probabilities of T o, j in the original dataset.The authors also noted that CAP s, j is undefined if the vector K o, j does not occur in the synthetic dataset.In their evaluation and in the calculation of the mean CAP score, they dealt with this scenario in two different ways: (1) Coding the corresponding CAP scores as 0 (2) Treating the corresponding CAP scores as undefined We will discuss both options and their justifications in our subsequent analysis of the approach.
It is worth to mention that there is a close relation between CAP scores and the well-known concepts of k-anonymity and l-diversity (see [5,13]).
k-Anonymity: A dataset has the k-anonymity property if, for every combination of attributes occurring in the data, the corresponding equivalence class consists of at least k elements.
(Distinct) l-Diversity: A dataset has the l-diversity property if, in every equivalence class, the sensitive variable (e.g., the target attribute T) takes on at least l distinct values.
If we restrict our attention to the original dataset and assume that the k-anonymity property is not satisfied for at least k = 2, there are records that, for some key, are the only elements in their equivalence class, and hence there are j ∈ {1, . . ., n} with CAP o, j = 1.If ldiversity is not satisfied for at least l = 2 and, therefore, there are not at least l distinct target values in each equivalence class, the same is true.In general, if datasets satisfy l-diversity for higher values of l, the CAP scores of the records are bound to be lower, and vice versa.
In the remainder of this section, we translate the CAP score approach into a solution for the attacker's classification problem.Moreover, we improve the approach from the attacker's perspective.We start by discussing the Fixed-Radius Nearest Neighbor classifier (FR-NN), which is implemented in the Python scikit-learn machine learning package scikit-learn3 .
Fixed-Radius Nearest Neighbor: Based on a metric m and a radius r specified by the user, this algorithm classifies data points by implementing a majority vote among neighbors within r .This variant of the better known k-Nearest Neighbor classifier is based on an efficient search for neighboring data points, which, depending on the circumstances, may be realized by the BallTree or the KDTree algorithm.In scikit-learn's implementation, the user can also specify a label for outlier samples which do not have neighboring data points within r .
We now reconsider the attacker's approach that is assumed by the CAP disclosure risk measure.For a certain attribute key K, the attacker knows K o, j for some record j in the original data, and has access to the synthetic dataset.The adversary then computes the equivalence class of K o, j in the synthetic dataset and, subsequently, the distribution of the target attribute T of interest.Now let S be the synthetic dataset and S K,T the dataset that results from omitting all attributes but the target T and those in the key K. Then the attacker's approach is equivalent to conducting a FR-NN classification for K o, j on S K,T .Note that we may choose a variety of metrics without affecting the result of the classification, since the attacker only considers neighbors within r = 0 (that is, equal data points).However, given that the approach is based on matches for the attributes in the key K, it makes sense to choose the Hamming Distance for m.For two data points (records) a = (a 1 , . .As a result of this application, the attacker obtains percentages for the possible values of the target T according to their occurrence in the equivalence class, that is, the r = 0 neighborhood.The percentage of the real target value T o, j is equal to CAP s, j .It makes sense to assume that the attacker is interested in both the target value with the highest percentage, that is, the result of classification via FR-NN, as well as in all occurring values together with their percentages.
For several reasons, the discussed approach assumed by the CAP measure is not optimal for solving the attacker's classification problem.For example, it may happen that K o, j does not occur in the synthetic dataset, hence does not have any neighbors within r = 0. CAP s, j is then undefined and, similarly, the FR-NN classifier is not able to assign a label.It has already been mentioned that the authors of [15] dealt with this scenario in two different ways, namely by either coding the corresponding CAP scores as 0 in the calculation of the mean CAP score, or treating them as undefined, which means that the respective record does not count towards n.In Section 3.3 of [15], justifications for both options are given.According to these, the basis for assigning a 0 is that a non-match is considered to have zero probability of yielding a correct attribution, whereas the logic behind recording non-matches as undefined is that an adversary is more likely to stop their attempt with a non-match.
Both options of handling the CAP scores correspond, in some way, to the inability of the attacker's FR-NN classifier to provide a label.However, we now propose an alternative method for the attacker to handle a non-match, which will lead to an improvement of the approach from their perspective.Consider the example of the Adult Census dataset from Section 2. We want to learn if our neighbor earns more than $50K a year.We know that she is in the dataset and we gained access to a synthesized version.Furthermore, we know her age, gender, race, occupation, marital-status and native-country, all of which are attributes in the dataset.A quick search reveals that no record in the synthetic data is a complete match for these attribute key values.Instead of giving up, we can now search for records that match at least 5 of the 6 attributes in the key.If we do find such records, we proceed by calculating the distribution of their target attribute values.If not, we try for records that match at least 4 attributes, and so on.The resulting algorithm may be implemented as follows.N ← a ∈ S K,T : ∆ K o, j , a K = r , where a K omits the value of T Similarly, we may describe this algorithm as repeated application of the FR-NN classifier for r = 0, 1, 2, . . .and the Hamming Distance.We stop as soon as neighbors are found and a label can be assigned.The algorithm may easily be adapted to not only return a prediction T * , but also the percentages of the possible values for the target attribute T .We believe that this procedure is superior to the approach assumed by the CAP disclosure measure, for the following reasons: (1) The methods yield the same result for all K o, j that appear in the synthetic data.Only non-matches are handled differently.(2) Non-matches are more likely to occur for longer attribute keys.However, the attacker is unlikely to stop her attempt to learn sensitive information from the synthetic data because her prior knowledge about the victim is, in this sense, "too detailed".Obtaining a prediction based on smaller attribute keys would be considered better than having no prediction at all.Moreover, the attacker is still able to use all of her prior knowledge by not considering one fixed smaller attribute key, but searching for all records within a certain radius to the vector of known attribute values.(3) Synthetic data with high utility preserves certain dependencies between attributes and is therefore also likely to yield high accuracy scores for our variant of the FR-NN classifier.
The third reason actually applies to all kind of machine learning classification models.There is nothing special about FR-NN or the general approach to search for matches of the known attribute values.The attacker's classification task may, like any other classification problem, be solved by a variety of different algorithms.We compared the original CAP score approach and our procedure presented above to several algorithms like NaiveBayes, Random-Forest and LogisticRegression.For the results of our experiments, we refer the reader to Section 4.
We have now discussed the improved approach from the attacker's perspective.Additionally, we may define a corresponding generalized CAP disclosure risk measure that may be used by the data provider.We therefore conclude this section by extending CAP s, j to GCAP s, j : , where ρ := min {r | ∃i ∈ {1, . . ., n} : ∆(K s,i , K o, j ) = r }.Both notions coincide for ρ = 0, but GCAP s, j is also defined when CAP s, j is not.

EVALUATION
In this section, we compare GCAP to CAP and also apply other machine learning algorithms to the attacker's classification problem.We use the Contraceptive Method Choice dataset 4 for our experiment, which is a subset of the 1987 National Indonesia Contraceptive Prevalence Survey.The table consists of 1,473 samples of married women and 10 attributes.It appeared suitable for our purposes because the attributes are an interesting composition of quasi-identifiers and potentially sensitive attributes.Furthermore, all attributes are either categorical or (in the case of 'age') may be treated as such, which has been assumed in the presentation of the approach in the last section.
In our experiment, we analyze the following two scenarios.We consider two subsets of the dataset's attributes as quasi-identifiers, and two different target variables: (1) QI = {'age', 'education', 'education of husband', 'number of children', 'religion', 'now working?', 'occupation of husband'} Target = 'contraceptive method' (2) QI = {'age', 'education', 'number of children', 'religion', 'now working?'} Target = 'education of husband' Scenario (1) is based on the fact that the preferred contraceptive method, as well as whether contraception is used at all, is potentially sensitive information for the individuals in the original dataset.The idea behind Scenario (2) is to investigate the possibility of gaining information about the husbands based solely on knowledge about the wives.Note that the attribute 'contraceptive method' has three distinct values in its domain, whereas 'education of husband' has four.Figure 1 shows the distribution of the target attributes in the original dataset.
In order to demonstrate the main difference between GCAP and CAP, we will use a mixture of smaller and larger key sizes.In Scenario (1), we use attribute keys of length three and six.To avoid limiting the analysis to certain subsets of the quasi-identifiers, we considered all subsets of QI with three and six elements.As a result, we investigated a large number of situations.For Scenario (2), we did the same for all attribute keys of length two and four.We will use the same attacker scenarios to discuss the capabilities of other machine learning classifiers, as well as the boundaries of our baseline approach.Let D be the table of the Contraceptive Method Choice dataset.For both Scenarios (1) and (2) and each key length k, we performed the following procedure.Note that non-matches do not occur on the original dataset, hence the notions of GCAP and CAP are equivalent, and the scores match.As part of our experiment, we also want to compare the disclosure risks on the synthetic datasets generated by the different tools.Therefore, we applied all synthesizers with default parameters to avoid any bias or unintended optimization.One exception is the Differential Privacy parameter for demonstrating its effect and the user's possibilities to influence the risk.In our summary of the experiment's results, we included ε = 0.1.Lower values of ε lead to more distortions in the data, whereas setting ε = 0 means to turn of Differential Privacy.Putting ε ≫ 0.1, one injects less noise and therefore observes results that are much closer to those on datasets produced by the DataSynthesizer without Differential Privacy.Since these observations agree with the definition of ε-Differential Privacy, we focused on presenting the results of the choice ε = 0.1, which is also used in the tool's documentation.
In Step (3), the scores are computed for each entry in D. As discussed in Section 2, there are different ways to summarize the related disclosure risk.For example, one may compute the mean scores over all records, which is done in [15].For our purposes, it seems more appropriate to focus on the measure discussed in the context of Reiter et al. 's approach in Section 2. Let j be an arbitrary record in D. In Step (3), we additionally compute the attribution probability of all occurring target values y, that is AP s, j,y for the synthetic datasets.AP o, j,y is defined similarly.Analogous to Equation 2.1 in Section 2, we define and compute R = m i=1 R j /m, where m = 1, 473 is the total number of records.Note that R corresponds to the accuracy of the related FR-NN classifier used by the attacker, and is therefore more interesting to us than the mean of the scores.However, we want to stress the fact that the following comparison between CAP and GCAP does not depend on focusing on accuracy, and we are able to draw similar conclusions by considering mean scores.
We now consider Table 1, which presents the scores for Scenario (1) and attribute key length three.The table summarizes the results of all possible keys amongst the variables in QI, that is, C(7, 3) = 35 different situations.Each cell contains the average of the respective  risks R over these 35 attribute keys, as well as the standard deviation.The table consists of three columns: 0CAP comprises the risk if non-matches are coded as 0, and ICAP shows the result for ignored non-matches.In the third column, we have the disclosure risk based on the GCAP measure.Table 2 presents the results for Scenario (1) and key length six, whereas the Tables 3 and 4 concern Scenario (2) with key lengths two and four.We start by making general observations.GCAP results in a higher disclosure risk than 0CAP.Since GCAP s, j ≥ CAP s, j holds for all records j, this is no surprise.The difference is significant for the larger keys in the Tables 2 and 4, which is also plausible since larger keys lead to an increasing number of non-matches.We point out that, in all tables, the risks entailed by GCAP are close to the risks that result from ICAP.Note again that ICAP just ignores non-matches and is only taken over matches.The large differences between 0CAP and GCAP already indicated that the number of ignored samples is significant in the Tables 2 and 4. Table 5 shows the average number of samples ignored by ICAP in each situation.We recall that the original dataset consists of 1,473 samples.Since ICAP and GCAP coincide on matches, the differences between them result from the varying scores of GCAP on the ignored samples.Since these differences are small, this experiment corroborates our claim that GCAP is a useful extension of the CAP disclosure risk measure.Whenever CAP s, j is undefined, the computation of GCAP s, j allows the data provider to give an adequate estimate for the risk of the respective record.Furthermore, we see that ignoring a large amount of samples or assigning them risk 0 leads to an underestimation of the dataset's total risk.
We now focus on the differences between the synthesizers.For the DataSynthesizer with disabled Differential Privacy (DS 0) and the synthetic dataset generated by synthpop (SP), the risk entailed by GCAP is generally higher than for the Synthetic Data Vault (DV) and the data generated by the DataSynthesizer with Differential Privacy (DS 0.1).This result was to be expected, as the latter tools tend to produce a larger distortion of the data and, therefore, lead to lower disclosure risks.More interesting is the comparison between smaller and larger key sizes.Compared to Table 3, the risk entailed by GCAP decreases in Table 4 for all tools except for synthpop.The risk development for larger key sizes is interesting and unexpected, as the attacker's situation improves due to an increase in prior knowledge.For example, we observe a substantial disclosure risk increase on the original dataset.In Scenario (1), the GCAP score rises from 54.9 to 84.0, which is a consequence of the fact that the equivalence class for large key sizes often contains only one element, namely exactly the record of the respective victim individual.From the attacker's perspective, the intuition is that there might be better ways to exploit longer key sizes on synthetic datasets than using the classifier related to the GCAP measure.We therefore continued to study this problem by comparing the performance of several algorithms suitable for solving the attacker's classification problem related to the Scenarios (1) and (2).In Tables 6-9, we show the results for Nave Bayes (NB), Support Vector Machine (SVM), K-NearestNeighbors (KNN), RandomForest (RF), Logistic Regression (LR) and the variant of the RadiusNearestNeighbor (FR-NN) classifier described by Algorithm 3.1.As explained earlier, the accuracy scores of the latter coincide with the GCAP disclosure risk measure.We utilised the scikit-learn package 5 in Python and employed all algorithms with the standard parameter settings, to avoid unintended optimization.algorithms to their problem and then picks a prediction by implementing a majority vote on the results of the classifiers.Indeed, we observe that ENS generally scores above average.On the synthetic datasets of Scenario (1), ENS even exceeds all other classifiers in five out of eight cases.
The results indicate that, in addition to GCAP, the accuracy score of the ensemble classifier is also worth to be considered as possible disclosure risk measure by the data provider.In Scenario (2), the accuracy of the ensemble on the synthetic data is relatively close to the accuracy on the original data.Clearly, the performance on the real data is an important reference point, as it usually constitutes an upper bound for the performance on the synthetic data.For the evaluation of the utility of the ensemble for the attacker, we should  8 also consider lower bounds in terms of the accuracy of dummy classifiers.A first baseline is given by generating predictions uniformly at random.If we suppose that the attacker already uses the synthetic dataset, the predictions can be generated based on the target attribute's distribution.For example, we may consider a dummy classifier that always predicts the most frequent value of the target attribute in the synthetic data (sometimes called the zero-rule classifier).
All four synthetic data generators preserve the distribution of attributes to some extent.Therefore, the most frequent value of the target attribute is the same for all datasets, which leads to the same constant prediction and, therefore, constant accuracy scores of the dummy classifier.For Scenario (1), this score is 42.7%; for Scenario (2), it is 61.0%.One might come to the conclusion that the accuracy scores of the ensemble, and hence the disclosure risk is still "small enough".On the synthetic datasets, 66% is never exceeded.However, to evaluate the general usefulness of data synthesis as privacy-preserving method, we have to consider not the absolute risk, but the decrease of disclosure risk relative to the original data.In this sense, the ensemble scores of DS 0 and SP in the Tables 6 to 9 exceed the respective dummy classifier baselines by a substantial margin, which may become more obvious by taking a look at the scores on a number line in Figure 2.
On the other hand, all synthetic datasets prevent the attacker from exploiting larger attribute key sizes for re-identification, which is the most important reason for the high accuracy of FR-NN on the real data in Tables 7 and 9. Furthermore, the DataSynthesizer can be used with Differential Privacy to lower the disclosure risk, although the results for varying values of ε are rather unstable.The synthpop package also comes with many possibilities for achieving more privacy, such as removing replicated statistical uniques from the generated dataset.All these options, however, will affect the quality and the utility of the synthetic data, which should also be considered for assessing the results of the Synthetic Data Vault.The relation between the utility and the privacy of synthetic data is best described as trade-off.
It has to be stressed that further experiments on other datasets are necessary to establish more empirical evidence.We therefore complemented our detailed experiment on the Contraceptive Method Choice dataset by considering two attacker scenarios for the Fertility dataset 7 .This dataset consists of 100 records of volunteers that provided semen samples.In ten attributes, it comprises a variety of sensitive health information, such as whether the patient had child diseases, accidents, serious trauma, or surgeries.Further features are the frequency of alcohol consumption, smoking habits and, of course, the diagnosis of the semen sample.Since only few variables seemed to be adequate candidates for the set of quasi-identifiers, we focused on the following two scenarios: 7 https://archive.ics.uci.edu/ml/datasets/Fertility(1) QI = {'age', 'alcohol', 'smoking habit'} Target = 'accident' (2) QI = {'age', 'alcohol', 'smoking habit'} Target = 'surgery' For both situations, we considered the average of the three attribute keys of length 2. Knowing only two of the three attributes in QI, the goal of the attacker is to infer whether their victim had an accident or surgery in the past.Tables 10 and 11 show the results.The dummy classifier baseline for the target 'accident' in Scenario (1) is 56%.Again, the ensemble exceeds this value substantially on DS 0 and SP, as the performance of DS 0 is actually close to the original data.For Scenario (2), the dummy baseline is 49% for SP and 51% for DS 0, DS 0.1 and DV.We may draw similar conclusions, although this is the first situation in which not only the use of DS and SP, but also of DS 0.1 and DV may lead to privacy breaches and considerable disclosure risk for certain records.In Figure 3, we consider the scores of ENS in Table 11 on a number line.

CONCLUSION AND FUTURE WORK
In this paper, we considered the problem of establishing a baseline for attribute disclosure risk on synthetic data.Given some prior knowledge in form of the values of several key attributes of a record of the original dataset and at least one synthesized dataset, what may the attacker infer about the record's entry for some sensitive target attribute?First of all, they may employ a zero rule classifier, which considers the distribution of the target attribute in the synthetic data and forms the prediction by choosing the most prominent entry.This straight-forward approach establishes a first baseline, but is superseded by other methods.We discussed Correct Attribution Probability, a recently published risk measure based on a matching mechanism, and generalized it to the GCAP measure, which also handles non-matches.In the evaluation, we saw that our approach improves the estimation of the disclosure risk, since it better reflects the ability of the adversary.Additional refinement of the accuracy scores is achieved by implementing several machine learning classifiers and employing an ensemble classifier, applying a majority vote on the obtained predictions of several individual classifier.We conducted our experiment by averaging over all possible attribute keys of certain length for a predefined set of quasi identifier variables, to provide an estimation of the average attack risks on all scenarios.
In Section 4, we saw that some of the evaluated synthetic datasets revealed sensitive information about the individuals in the original data.This can be prevented by using the disclosure control measures available to the user of the discussed tools.The influence on the quality and utility of the resulting synthetic data is certainly interesting and worth to be subject of further investigation.However, we point out that there are conceptual limits to the pursuit of keeping data utility and simultaneously decreasing disclosure risk.In the long-key scenario of  Data Vault decreased the initially considerable disclosure risk of the original dataset down to the dummy classifier baseline.Obviously, this was not possible without also decreasing the utility of the synthetic dataset for training machine learning classifiers to predict the choice of contraceptive methods.Note that we just described one fact from two different perspectives.On an abstract level, the same property of the dataset has been altered by the synthesizer.This strong conflict between utility and disclosure prevention occurs whenever the target attribute in the applied classification task is a sensitive attribute.If the sensitive attribute is among the predictors, the problem is less drastic.In future work, we will therefore study the optimization problem of keeping data utility high and decreasing disclosure risk of sensitive predictor variables.Besides the mentioned experiments on other datasets, our future research will also concern the attacker's possibilities to make better use of prior knowledge and larger attribute keys.Finally, GCAP and all other concepts in this paper are only considered for categorical attributes.A generalization to continuous variables appears feasible.

Algorithm 3 . 1 .
Input: A synthetic data set S, a target attribute T in S and an attribute key K together with a value vector K o, j of an original data's record Output: A prediction T * for T o, j 1: Set N = ∅ and r = 0. 2: while N = ∅ do 3:

4 :r ← r + 1 5 :
Choose T * via majority vote among the values of T for the elements of N

Session 4 :Figure 1 :
Figure 1: Distribution of target variables 4 https://archive.ics.uci.edu/ml/datasets/Contraceptive+Method+Choice(1) Generate four synthesized versions of D of equal length: • The DataSynthesizer without Differential Privacy • The DataSynthesizer with Differential Privacy (ε = 0.1) • The Synthetic Data Vault • The synthpop package in R (2) Compute all k-element subsets of the quasi identifiers QI of the respective scenario.Each subset corresponds to an attribute key used in the following step.(3) For each dataset, for each attribute key and the target of the scenario: • Compute the the CAP scores of all records in D, where non-matches get CAP score 0. • Compute the CAP scores of all records in D, and ignore non-matches.• Compute the GCAP scores of all records in D.

Table 5 :
Average number of samples ignored by ICAP

Table 7 ,
the Synthetic Machine Learning Algorithms Accuracy for the Attacker's Classification Problem Fertility Dataset