Time-Aware Anonymization of Knowledge Graphs

Knowledge graphs (KGs) play an essential role in data sharing because they can model both users’ attributes and their relationships. KGs can tailor many data analyses, such as classification where a sensitive attribute is selected and the analyst analyzes the associations between users and the sensitive attribute’s values (aka sensitive values). Data providers anonymize their KGs and share the anonymized versions to protect users’ privacy. Unfortunately, an adversary can exploit these attributes and relationships to infer sensitive information by monitoring either one or many snapshots of a KG. To cope with this issue, in this paper, we introduce (k, l)-Sequential Attribute Degree ((k, l)-sad), an extension of the kw-tad principle[10], to ensure that sensitive values of re-identified users are diverse enough to prevent them from being inferred with a confidence higher than \(\frac{1}{l} \) even though adversaries monitor all published KGs. In addition, we develop the Time-Aware Knowledge Graph Anonymization Algorithm to anonymize KGs such that all published anonymized versions of a KG satisfy the (k, l)-sad principle, by, at the same time, preserving the utility of the anonymized data. We conduct experiments on four real-life datasets to show the effectiveness of our proposal and compare it with kw-tad.


14:2
A.-T. Hoang et al. fraud attribute; or the disease attribute of KGs in Figure 1). The classification tasks analyze KGs to extract associations between users' data and their sensitive attribute' values. Since these analyses require a huge amount of information, data providers share their KGs to improve the quality of final results. Moreover, data providers regularly publish/share new KGs when they have new data about their users. However, this continuous sharing can potentially reveal the values of users' sensitive attribute. To avoid this leakage, data providers can apply a naive anonymization technique to hide the associations between users' real identities and their sensitive attribute's. This can be done by removing users' explicit identifiers (e.g., name, email) in the shared KGs. However, adversaries can still re-identify users in the anonymized KGs and infer their sensitive values [11]. Therefore, stronger anonymization techniques must be developed to protect users in anonymized KGs.
If we consider graphs anonymization, then k-anonymity and differential privacy (DP) [13] are the two main approaches available in the literature. For example, k-degree, k-neighborhood, and k-automorphism [13] are k-anonymous techniques that ensure that adversaries cannot re-identify users with a confidence higher than 1 k by exploiting their degrees and neighbor structure in the anonymized graphs. However, all k-anonymity techniques rely on precise assumptions about the information the adversaries can exploit to re-identify their target users in the anonymized data (e.g., node degree). On the other hand, DP techniques [5,6,12,18] create differential private analysis algorithms that extract statistics from original graphs such that adversaries cannot infer the existence of users (i.e., nodes) or their sensitive relationships (i.e., edges) by looking at the statistics. For instance, the users' degree distribution and the distribution of attributes' values of communities of users are extracted in Reference [5,6] while hiding the presence of users in the graph. Although DP solves k-anonymity's limitations by not relying on assumptions on adversaries' information, data providers must design a differential private analysis algorithm for each type of statistics that data recipients (e.g., data analysts) require. This will greatly diminish the flexibility, since all searches must be known in advance. As a result, in this article, we follow the k-anonymity approach to design an anonymization technique that does not restrict current KGs' analyses while protecting users' privacy against the most widely exploited adversary background knowledge.
Recently, two k-anonymity proposals tailored to KGs have been proposed [11,22] (cf. Section 7 for more details). Reference [22] removes the associations between users and their sensitive values by creating groups of at least k users and sharing these groups and the frequency of their users' sensitive values separately. Unfortunately, the extracted frequencies only represent general information about the original KGs, and, thus, the generated KGs are not compatible with most KGs' applications (e.g., drug discovery, fraud detection). k-Attribute Degree (k-ad) [11] has been presented to address this issue by generalizing users' attributes and relationships such that their attribute values and relationship out-/in-degrees are identical to those of at least k−1 other users in anonymized KGs. However, both these proposals have two main drawbacks. The first is that they target a scenario where a KG is anonymized only once. In contrast, in many application scenarios, data providers may update their KGs and publish their new anonymized versions. Second, adversaries can still infer the sensitive value of the target victim if all users in the same anonymized group (i.e., with same relationship out-/in-degrees) have the same sensitive value (also known as homogeneity attack).
To cope with the first issues, in Reference [10], we introduced k w -Time Varying Attribute Degree (k w -tad), which protects users from being re-identified when adversaries monitor attributes' values and relationships' out-/in-degrees in w continuous anonymized KGs. However, k w -tad does not protect users from homogeneity attacks. Example 1.1. Figure 2 illustrates anonymized versions of KGs in Figure 1 satisfying 3 2 -tad. The attribute values and relationship out-/in-degrees in G 1 (Figure 2(a)) of Ken (u 1 ) are identical to those of two other users: Ahmed (u 3 ) and Frank (u 5 ). The same goes for Lydia (u 2 ), Simon (u 4 ), and Deniz (u 6 ). By looking at G 1 , adversaries cannot differentiate which users are Ken among u 1 , u 3 , and u 5 . However, because these users have the same sensitive value (i.e., f lu), adversaries can still infer that Ken's disease is f lu.
In this article, we propose the (k, l )-Sequential Attribute Degree Principle ((k, l )-sad), an extension of k w -tad, able to protect users from homogeneity attacks done on different snapshots of an anonymized KG, where k, l are two integers such that 1 ≤ l ≤ k and k > 1. (k, l )-sad leverages on the distinct l-diversity [13] principle defined for relational data, which ensures that users' sensitive values cannot be inferred with a confidence higher than 1 l . (k, l )-sad allows data providers to specify a sensitive attribute (e.g., disease) to be protected against homogeneity attacks and protect all users' attributes even though adversaries monitor all published anonymized KGs.
Moreover, we develop the Time-Aware Knowledge Graph Anonymization Algorithm (TAKGA) to generate KGs satisfying the (k, l )-sad principle. TAKGA exploits clustering techniques to generate anonymized KGs according to two main steps: Clusters Generation and Knowledge Graph Generalization. The Clusters Generation step creates clusters whose users' generalized attributes/relationships satisfying the (k, l )-sad principle, called valid clusters. In this step, TAKGA allows data providers to adopt any preferred clustering algorithm (e.g., k-Medoids [7], HDBSCAN [3]). However, as discussed in Section 3, they could not always generate valid clusters. To cope with this, we propose a strategy to modify generated clusters to make them valid, if possible, as well as we present our own clustering algorithm for generating valid clusters (see Section 7.3). Finally, the Knowledge Graph Generalization step takes as input the valid clusters returned by the Clusters Generation step and uses the Knowledge Graph Generalization Algorithm (KGG) proposed by us in Reference [11] to generalize users attributes and relationships such that the attribute values and relationship out-/in-degrees of those in the same cluster are identical.

Symbol
Meaning G t , G t A KG and its anonymized version at time t д t , д t A sequence of continuous KGs and their anonymized version at times: 1, 2, . . . , A set of nodes, users, and attributes' values in G t and G t A set of relationship types, user-to-user relationship types, and user-to-attribute relationship types in G t and G t A set of edges, user-to-user edges, and user-to-attribute edges in G t and G t T A set of timestamps: 1, 2, . . . , t in д t U t , U t A set of users appearing at least once in д t and д t r a , r r A user-to-attribute and a user-to-user relationship type v a , v r A attribute value and a user The relationships' in-degrees of user u in G t and G t I t (u), I t (u) The attributes' values and relationships' out-/in-degrees of user u in G t and G t I t (u), I t (u) The sequence of attributes' values and relationships' out-/ in-degrees of user u in д t and д t s (u) The sensitive values of user u Siд t (u) The signature of user u at time t C t , C t The set of clusters and valid clusters generated at time t C t (u) The set of users whose sequence of attribute values and relationship out-/in-degrees in д t are identical to those of user u Example 2.2. The Historical Users Set of the sequence of snapshots of a KG д 3 in Figure 2 is We assume that an adversary re-identifies his/her target user u by exploiting the background knowledge that he/she knows about u and the attribute values and relationship out-/in-degrees that he/she can extract from his/her accessed anonymized snapshots. In particular, we denote with Then, by monitoring w continuous anonymized snapshots of a KG, denoted with д w t = (G t −w +1 , G t −w +2 , . . . ,G t ), the adversary can obtain the knowledge including the sequence of extracted anonymized attributes' values and relationships' out-/in-degrees, denoted as I w t (u) = {I t −w +1 (u), I t −w +2 (u), . . . , I t (u)} [10]. Then, let U w t ⊆ U t be the set of users whose data are Thus, the confidence of reidentifying the target user u is 1 We presented the k w -tad [10] to protect users' identities when an adversary exploits w continuous anonymized snapshots of a KG д w t . k w -tad prevents any user u in U w t from being reidentified with a confidence 1 k when an adversary exploits the above knowledge on w continuous anonymized snapshots: д w t . We formally k w -tad as follows.
. . , G t ) be a sequence of w continuous published snapshots of a KG and U w t be the set of users whose data are published in д w t . д w t satisfies k w -ad, if and only if, for every user u ∈ U w t , there exists a subset of k w -tad has two main disadvantages. The first is that k w -tad is designed to anonymize KGs in the general case in which the data provider does not specify a sensitive attribute in KGs. Thus, when all users in C w t (u) have the same value for the sensitive attribute, an adversary can infer the real value of the target u's sensitive attribute. Second, it is hard for the provider to specify an appropriate value for w. In what follows, we describe the knowledge that the adversary exploits to infer the value of his/her target user u's sensitive attribute and our principle addressing these disadvantages.

Adversary Knowledge
Given a sequence д t of snapshots of a KG, to protect users' identities the data provider publishes д t 's anonymized versions, i.e., д t , satisfying k w -tad [10]. By monitoring all published anonymized snapshots of a KG: д t , an adversary can extract the sequence of attribute values and relationship out-/in-degrees from д t : I t (u) = {I 1 (u), I 2 (u), . . . , I t (u)}. This allows the adversary to find a subset of users in U t , i.e., C t (u), such that C t (u) = {v ∈ U t |I t (u) = I t (v)}.
Example 2.4. Let consider sequence of anonymized KGs in Figure 2. At time 1, the attribute values and relationship out-/in-degrees of Ken (u 1 ) in G 1 (Figure 2(a)) are Here, the anonymized attribute values and relationship out-/in-degrees of u 1 are equal to those of u 3 and u 5 . Similarly, u 2 , u 4 , and u 6 also have the same attribute values and relationship out-/in-degrees in G 1 .
At time 2, data about u 1 in G 2 (Figure 2 Since u 1 , u 3 , and u 5 have the same attribute values and relationship out-/in-degrees in G 1 , k w -tad ensures that these users also have the same attribute values and relationship out-/in-degrees in G 2 . As u 2 is removed by the provider, k w -tad removes u 4 and u 6 whose attribute values and relationship out-/in-degrees in G 1 are equal to those of u 2 . Moreover, k w -tad adds two fakes users fu 1 , and fu 2 whose attribute values and relationship out-/in-degrees in G 3 are identical to those of u 7 .
At time 3, since Frank (u 5 ) and Deniz (u 6 ) are removed, k w -tad removes u 1 , u 2 , u 3 , u 4 from G 3 (Figure 2(c)). Moreover, data about Bob (u 8 k w -tad adds two fake users fu 3 and fu 4 whose attribute values and relationship out-/in-degrees in G 3 are identical to those of u 8 . Similarly, it also keeps those of u 7 , fu 1 , and fu 2 identical. By exploiting д 3 , the adversary can determine that the sequence of attribute values and relationship out-/in-degrees in д 3 of u 1 are identical to those of u 3 and u 5 , or 3 , fu 4 }. Therefore, the adversary cannot re-identify any user in U 3 with a confidence higher than 1 3 . An adversary can exploit C t (u) to infer u's sensitive value. More precisely, let C t be the set of users whose anonymized data in G t are identical to those of u. Even if the adversary cannot identify which user in C t is u, he/she can infer that u's sensitive value is among sensitive values of users in C t . The set containing these sensitive values is called user u's signature at time t, i.e., Siд t (u). We formally define the signature at time t of user u as follows.
} be the set of users whose attribute values and relationship out-/in-degrees in G t are identical to those of u. The signature of user u at time t, denoted by Siд t (u), is the set of sensitive values of C t 's users: Example 2.6. Referring to Example 2.4, at time 1, u 2 's attribute values and relationship out-/indegrees in G 1 (Figure 2(a)) are identical to those of u 4 and u 6 . In other words, If the adversary has access to only G 1 , then he/she can infer u 2 's sensitive value with a confidence 1 3 . However, since the adversary can identify that u 1 's sensitive value is f lu. By exploiting G 2 (Figure 2(b)), the adversary knows that C 2 (u 1 ) = {u 1 , u 3 , u 5 }. Thus, Siд 1 (u 1 ) = Siд 2 (u 1 ) = { f lu}. Since k w -tad does not consider users' sensitive values and C 2 (u 7 ) = {u 7 , fu 1 , fu 2 }, it assigns randomly sensitive values for fu 1 , fu 2 . Therefore, Siд 2 (u 7 ) = {s (u 7 ), s (fu 1 ), s (fu 2 )} = {bronchitis, f lu, dyspepsia}. The confidence of inferring u 7 's sensitive value is 1 3 . G 3 (Figure 2 By exploiting G 3 , the adversary cannot inferring u 7 's sensitive value with a confidence higher than 1 3 . Since the adversary knows that u 7 's sensitive value is in both Siд 2 (u 7 ) and Siд 3 (u 7 ), he/she can find the intersection of these signatures: Siд 2 (u 7 )∩ Siд 3 (u 7 ) = {bronchitis, f lu, dyspepsia}∩ {bronchitis, f lu, дastritis} = {bronchitis, f lu}. Therefore, he/she can increase the confidence of inferring u 7 's sensitive value to 1 2 by exploiting G 2 , G 3 . We formally define the adversary knowledge containing all the mentioned-above extracted attributes' values and relationships out-/in-degrees and signatures as follows: Definition 2.7 (Adversary Knowledge). Let д t = (G 1 , G 2 , . . . ,G t ) be a sequence of anonymized KGs published by a data provider and u ∈ U t be a target user. The knowledge that an adversary can exploit to infer the sensitive value of u contains:

(k,l)-Sequential Attribute Degree
To allow data providers to have flexibility in protecting their users' identities and sensitive values from any adversaries with the background knowledge given by Definition 2.7, we present the (k, l )sad, where k, l are two integers and l ≤ k. In particular, given a sequence of anonymized KGs д t , if the providers assign 1 to l, then this principle prevents any user appearing at least once in д t from being re-identified with a confidence higher than 1 k . Otherwise, if l is higher than 1, then this principle protects not only users' identities but also their sensitive values from being re-identified with a confidence higher than 1 l . The principle is formally defined as follows. Definition 2.8 ((k, l )-Sequential Attribute Degree). Let д t be a sequence of continuous anonymized KGs that a data provider has published. д t satisfies (k, l )-Sequential Attribute Degree, if and only if, for every user u ∈ U t : • there exists a set of users If a sequence of anonymized KGs д t satisfies (k, l )-sad, then both identities and sensitive values of its users are protected even though its provider inserts new users, removes some users, re-inserts, and updates attributes and relationships of his/her users (we refer the reader to Section 5 for more details).

TIME-AWARE KNOWLEDGE GRAPH ANONYMIZATION ALGORITHM
This section introduces the key idea of our anonymization algorithm, namely the TAKGA. The algorithm takes as input G t , a KG at time t; д t −1 , that is, the sequence of t − 1 already published anonymized versions of G t ; A, a clustering algorithm (e.g., k-medoids [7], HDBSCAN [3]); k, l, which are two integers given by data providers; and τ , a threshold. It generates G t , the anonymized version of G t , such that the sequence of snapshots of a KG д t satisfies the (k, l )-sad principle. TAKGA uses τ , ranging from 0 to 1, to prevent users from being inserted into too far clusters (see discussion on Function 1). TAKGA leverages on clustering techniques to anonymize users' data in G t (i.e., attributes' values and relationships' out-/in-degrees), such that users are in valid clusters.
We formally define valid clusters as follows.

Definition 3.1 (Valid Cluster).
Let G t be the KG at time t, д t −1 be the sequence of published snapshots of a KG, and k, l two integers, where l ≥ k. A cluster c of users in V U t that is generated at time t is valid if and only if: (3) the signature of c's users, i.e., Siд c = {s (u)|u ∈ c}, has at least l values; (4) ∀u ∈ c, Siд 1 (u) = Siд 2 (u) = · · · = Siд t (u).
The first and second condition of Definition 3.1 ensure that, for each user, there are at least k − 1 other users whose anonymized data in д t −1 and G t are identical, i.e., the first condition of (k, l )-sad is satisfied. This is achieved by generalizing data belonging to users in the same valid cluster so that they become identical. In contrast, the third and final condition of Definition 3.1 ensure that users' signatures have at least l sensitive values and the signatures do not change in different anonymized versions of the considered KG, i.e., the second condition of (k, l )-sad is satisfied. In what follows, we better detail the two steps of our algorithm: Clusters Generation and Knowledge Graph Generalization.
Clusters Generation. Several clustering algorithms (e.g., k-Medoids [7] and HDBSCAN [3]) have been applied to anonymize relational data, graphs [13], and KGs [10,11], but none of them considers users' signatures. Thus, these algorithms can generate invalid clusters. Moreover, clusters generation must handle three different scenarios: (1) insertion of new users, whose data are not published in previous anonymized KGs; (2) deletion of existing users, whose data published in previous anonymized KGs; and (3) deleted/existed users' re-insertion/update. To this end, we design three strategies: the New Users Handling (NUH), the Deleted Users Handling (DUH), and the Updated/Re-Inserted Users Handling (URUH). These strategies handle the above-mentioned scenarios such that new/re-inserted/updated users are in valid clusters, whereas deleted users are still protected according to (k, l )-sad.
New Users Handling Strategy: Since new users' data are not published in previous KGs, NUH only needs to ensure that new users are in clusters that have at least k users and these users share at least l distinct sensitive values (the first and third condition of Definition 3.1). More precisely, NUH first uses the provided clustering algorithm A to assign users into clusters such that users in the same clusters have similar attributes and relationships. The similarity between two users' data is measured by using the Attribute and Degree Information Loss Metric (ADM) [11], namely d adm . According to this metric, the distance between two users is estimated as the information we lose by making their attributes' values and relationships' out-/in-degrees identical. The measure will be detailed in Section 4.1. NUH then modifies clusters returned from A to make all of them valid. This strategy uses two modification approaches: adding fake users to invalid clusters and assigning invalid clusters' users to other valid clusters. The first approach keeps adding fake users to invalid clusters until these clusters are valid.  4 , u 6 } be clusters generated by the provided clustering algorithm A and suppose k, l are 3, 2, respectively. Therefore, according to Definition 3.1, c 1 and c 2 are invalid, whereas c 3 is a valid cluster. The first approach to make c 1 valid is to add a fake user fu 1 to c 1 , that therefore consists of three users: u 1 , u 2 , fu 1 , which have two distinct sensitive values (i.e., f lu and дastritis). Then at time 3 (cf. Figure 1(c)), the approach adds a fake user fu 2 to the invalid cluster containing Bob (u 8 ) and Gavin (u 7 ) to create a valid one.
The second approach to avoid invalid clusters is to remove them and assign their users to other valid clusters. To prevent a user from being assigned to a too far valid cluster (according to the metric given by Definition 4.1), NUH only does the assignment if the distance is less than or equal to the maximum distance defined by the threshold τ , specified by data providers. Users whose distances to all valid clusters' users are higher than the maximum distance are removed. The higher the threshold, the more users NUH removes.
Example 3.4. Referring to Example 3.3, at time 1, the invalid cluster c 2 = {u 5 } is removed. Suppose the maximum distance, measured by τ , is higher than the distance between u 5 and the valid cluster c 3 = {u 3 , u 4 , u 6 }. u 5 is assigned to c 3 . As a result, the valid cluster {u 3 , u 4 , u 6 , u 5 } is created.
NUH combines both approaches to modify invalid clusters such that they are valid. In particular, before modifying an invalid cluster, it finds the cluster's users who cannot be assigned to any cluster and fake users must be added to make this cluster valid. If the number of nonassignable users is higher than the number of fake users, then it will add fake users to the cluster. Otherwise, it assigns its users to other valid clusters. To make c 1 valid, this strategy must add a fake user or assign its users to the valid cluster c 3 . Suppose that the specified maximum distance is high enough to unable to assign u 1 and u 2 to c 3 . Since the number of nonassignable users (i.e., 2) is higher than the number of required fake users (i.e., 1), this strategy adds a fake user fu 1 to c 1 . Similarly, instead of adding two fake users to c 2 , it assigns u 5 to c 3 and removes c 2 .
Deleted Users Handling Strategy: DUH manages user deletion by ensuring that for each deleted user u, there are at least k − 1 other deleted users whose anonymized data in д t −1 are identical to that of u. These deletions make the signatures of these users unchanged at time t and the set of exploitable anonymized attribute values and relationship out-/in-degrees (i.e., I t (u)) at time t an empty set. Therefore, both properties of (k, l )-sad are satisfied.
Example 3.6. At time 2, there is a deleted user Lydia (u 2 ). Since in G 1 (Figure 3(a)), u 2 has the same anonymized attribute values and relationship out-/in-degrees with u 1 and fu 1 , TAKGA deletes u 1 , fu 1 to protect the deleted user u 2 . Similarly, at time 3, since Frank (u 5 ) and Deniz (u 6 ) are removed from G 1 (Figure 1(c)), TAKGA removes u 3 and u 4 .
Updated/Re-Inserted Users Handling Strategy: URUH generates clusters for updated/re-inserted users whose anonymized data are published in д t −1 . This strategy ensures that these users are in clusters of at least k users whose anonymized data in д t −1 are identical and these users' sensitive values are identical to their previous signatures. It also splits updated/re-inserted-user clusters with a number of users greater than or equal to 2 × k. Here, NUH ensures that these users' signatures have at least l sensitive values at the first time they are inserted. Thus, by ensuring these signatures are not changed at time t, URUH makes these clusters satisfying properties 3 and 4 of valid clusters.
Since their clusters also have at least k users and these users have the same sequence of anonymized data in д t −1 , updated/re-inserted users are in valid clusters.
Example 3.7. In G 2 (Figure 3(b)), URUH keeps u 3 , u 4 , u 5 , u 6 in the same cluster, because these users have the same attribute values and relationship out-/in-degrees in G 1 (Figure 3(a)) and their signatures in G 2 are equal to those in G 1 .
Knowledge Graph Generalization. Given the set of clusters generated in the previous step, this step aims at generalizing attributes and relationships of users in the same cluster such that these users' are protected according to the (k, l )-sad principle (Definition 2.8). To this end, for each new/re-inserted/updated user in the current snapshot, this step must generalize users' data such that the attributes' values and relationships' out-/in-degrees of those in the same cluster are identical.
To generalize users' data, this step leverages on the KGG proposed in Reference [11], that generates G t 's anonymized version, i.e., G t , by adding/removing edges in G t such that the attributes' values and relationships' out-/in-degrees of users in the same cluster are identical. KGG generalizes attributes' values of users in a cluster by finding the union of their attributes' values and adding user-to-attribute edges to make the attributes' values of these users identical to those contained in the union. The generalization of relationships of users in a cluster is done by finding the maximum out-/in-degree of all relationship types of these users and adding user-to-user edges to make out-/in-degrees of these users identical to the maximum one. If KGG cannot add user-touser edges to increase these users' out-/in-degrees, then it removes edges to reduce the maximum out-/in-degrees and continue adding user-to-user edges.
In the next section, we will detail the Clusters Generation step while interested readers can refer to Reference [11] for more details about KGG.

CLUSTERS GENERATION
In this section, we introduce the algorithms underlying the Clusters Generation step, which takes as input the snapshot of a KG at time t, i.e., G t , a sequence of already published snapshots of the considered KG д t −1 , a clustering algorithm A, two integers k, l, and a threshold τ , and generates valid clusters such that all involved users are protected by (k, l )-sad.
Before this, we introduce the adopted similarity measure.

Attribute and Degree Information Loss Metric
The similarity between two users' data is measured by using the ADM [11], namely d adm . According to this metric, the distance between two users is estimated as the information we lose by making their attributes' values and relationships' out-/in-degrees identical. As such, the higher the distance between two users is, the less they are similar. d adm estimates the information loss of u and v by finding the differences between their generalized attribute values and relationship out-/in-degrees with their original ones. The generalized values of an attribute r a for user u and v, (i.e., GV The generalized out-/in-degree of a relationship type r r for user u and v, are the maximum out-/in-degree of relationship type r r of u and v in G t . The Attribute and Degree Information Loss Metric is then defined as follows. [11]). Let u, v be two users in a KG G t . The ADM of making u and v having the same values on all attributes, and the same out-and in-degree on all types of relationships is computed as follows:

Definition 4.1 (Attribute and Degree Information Loss
where d am , d o dm , and d i dm are the attribute, out-degree, and in-degree distances between u and v. d am , defined as follows: Input: G t : the version of the input KG at time t ; д t −1 : the sequence of previous anonymized snapshots of the considered KG; A: the selected clustering algorithm; k, l : two positive integers; τ : a positive float number. Output: the set of valid clusters C t . , by calling the New Users Handling Algorithm (Algorithm 2), which exploits the provided clustering algorithm A to generate clusters of U new t 's users (line 3). Next, it handles deleted users whose data are published in д t −1 (i.e., U t −1 ) and not in V U t and re-inserted/updated ones whose data are in both U t −1 and V U t (lines 4-11). In particular, it first categorizes users in U t −1 into clusters (i.e., C t −1 ) such that each cluster contains users whose anonymized data in д t −1 are identical (line 4). For each set of users U existed t in C t −1 whose anonymized data in д t −1 are identical, the algorithm computes U deleted t , that is the subset of U existed t denoting users who are not in the current snapshot G t (line 6). Then, it calls the Deleted Users Handling Algorithm (Algorithm 3) with U deleted t , U existed t , and k as parameters to find the set of users U deletinд t in V U t who must be deleted to protect users in U deleted t (line 7). Then, the algorithm computes U updat ed t , that is, the set of updated/re-inserted users whose anonymized data have been published in one of the graphs in д t −1 and will be published at time t (line 8). Next, it calls the Updated/Re-Inserted Users Handling Algorithm (Algorithm 4) with U updat ed t , D t , and k as parameters, to generate clusters of updated/re-inserted users: C updat ed t (line 9). Each cluster in C updat ed t contains users whose anonymized data in д t −1 are identical, its users' sensitive values are the same as their previous signatures, and it does not have higher than or equal to 2 × k users. The returned set of clusters C updat ed t is added to C t (line 10). Finally, the algorithm returns C t (line 12). The following sections describe in more detail the three above-mentioned algorithms for handling new, deleted, and re-inserted/updated users.

New Users Handling Strategy. Given a set of new users
the provided clustering algorithm: A,the distance matrix storing ADM distances between users in G t : D t , two integers: k, l, and a threshold: τ , Algorithm 2 generates a set of valid clusters containing the new users C new t . Algorithm 2 first checks if U new t has at least 2 × k users, and, if this is the case, then it uses the provided clustering algorithm A (e.g., k-Medoids [7] and HDBSCAN [3]) to cluster users in end if 10: end for 11: Let d max , d min be the maximum and minimum distance in D t 12: , if their distances are less than or equal to the threshold distance τ d , or add fake users to make these clusters valid (line 13). To minimize the number of big clusters whose size is higher than or equal to 2 × k in C end if 10: end for 11: return Cc t , U r and the threshold distance: τ d . For every invalid clusterc in C t , the procedure finds the signature of users inc: Siдc (line 2). It then calculates the number of fake users that must be added to makec a valid cluster (i.e., having at least k users such that they share at least l distinct sensitive values): n f ake (line 3). Next, it calls Function Assiдn_Clusters () to assign users inc to their nearest cluster in C new t such that the distance between a user and his/her nearest cluster is less than or equal to τ d (line 4). The assigned clusters are then stored in Cc t , whereas users inc who cannot be added to any cluster in C new t are stored in U r . If n f ake is higher than the number of users in U r , then , the distance matrix storing ADM distances between users in G t : D t , and the threshold distance: τ d . It first initializes the set of new valid clusters Cc t as C new t and the set of removing users U r as empty (line 1 and 2). Next, for every user u inc, it finds u's closest cluster c min whose distance to u is less than or equal to τ d , by calling Function Find_Closest_Cluster () (line 4). If c min is found, then it adds u to c min (line 6). Otherwise, u is added to U r (line 8). Finally, it returns Cc t and U r (line 11).
Function Add_Fake_Users(): This function takes as input a clusterc, its users' signature Siдc , and two integers k, l. At the beginning, it initializes the resulting cluster c asc and its signature Siд c as Siдc (lines 1 and 2). While c has less than k users or Siд c has less than l sensitive values, it starts adding fake users to c (lines [3][4][5][6][7][8][9][10][11][12]. In particular, if Siд c has less than l sensitive values, if |Siд c | < l then 5: Let s be a random sensitive value that is not in Siд c 6: else 7: Let s be a random sensitive value 8: end if 9: Let u f ake be a fake user whose sensitive value is s 10: c ← c ∪ {u f ake } 11: Siд c ← Siд c ∪ {s } 12: end while 13: return c

Deleted Users Handling Strategy.
Given the set of users whose anonymized data in д t −1 are identical U existed t , the set of U existed t 's deleted users U deleted t ⊆ U existed t at time t, and an integer k; this strategy, implemented in Algorithm 3, is designed to ensure that for every deleted user in G t , there are at least k − 1 other ones whose attribute values and relationship out-/in-degrees in д t −1 are identical to his/hers. Moreover, it returns the set of valid clusters C existed t each of which contains users that have the same anonymized attribute values and relationship out-/in-degrees in д t −1 .
This strategy is implemented by Algorithm 3. First, the algorithm initializes the set of users U deletinд t and the set of clusters C existed t as empty (lines 1 and 2). Then, it finds the set of clusters C t −1 containing users whose signatures in the previous anonymized KG G t −1 are identical (line 3). Even though clusters in C t −1 are valid due to Algorithm 2, these clusters can become invalid in G t , because some of their users are deleted. Thus, the algorithm deletes these invalid clusters (lines 4-13). In particular, for every cluster c in C t −1 , it finds the set of users U c t in c who are not deleted

Updated/Re-Inserted Users Handling Strategy.
Given a set of updated/re-inserted users whose anonymized data in д t −1 are identical: U updat ed t , the distance matrix of users in G t : D t , and an integer: k, this strategy, implemented by Algorithm 4, generates valid clusters of U updat ed t 's users and prevents such clusters from being too big (i.e., having higher than or equal to 2 × k users).
Algorithm 4 implements this strategy. First, the algorithm finds the set of clusters C Function Create_Minimal_Signature_Clusters(): This function takes as input a cluster: c, the distance matrix of users in G t : D t , the signature of c's users: Siд c , and a sensitive attribute's value: s min , which belongs to a user in c. First, it finds users whose sensitive value is s min , i.e., U min (line 1) and initializes C c t with the empty set (line 3). For each user u in U min , it creates a cluster c u containing only u (line 5). Then, for each sensitive value s that is in Siд c and different from s min , the function finds U r 's users whose sensitive value is s: U s (line 7), then it finds c u 's closest user in U s : u min (line 8), and adds u min to c u (line 9). c u is added to C c t (line 11) and its users are removed from U r (line 12). Finally, it returns C c t and the remaining users in U r (line 14). Section 5 analyzes how TAKGA ensures that all involved users are protected according to (k, l )-sad.

PRIVACY ANALYSIS
In this section, we prove that (k, l )-sad and TAKGA always protect users even though data providers insert/delete/update/re-insert users in the target KG. If a sequence of anonymized versions of a KG д t satisfies (k, l )-sad, then both users' identities and sensitive values are protected, as the following theorem states.
Theorem 5.1. Let д t be a sequence of snapshots of a KG held by a data provider, where each snapshot is created by modifying the previous snapshot, i.e., inserting new users, deleting existing users, re-inserting users or updating existing users' attributes and relationships. If its anonymized version д t satisfies (k, l )-sad, then an adversary cannot exploit his/her background knowledge (cf. Definition 2.7) to infer the identities and sensitive values of users in U t with a confidence higher than 1 k and 1 l , respectively.
Proof. Suppose that д t satisfies (k, l )-sad and an adversary can infer u's identity or sensitive value with a confidence higher than 1 k or 1 l , respectively, by using the background knowledge specified by Definition 2.7. This means that the adversary can extract u's anonymized information: attributes' values and relationships' out-/in-degrees in д t (i.e., I t (u)). Then, it must find a set of users in U t : C t (u) = {v ∈ U t |I t (u) = I t (v)} and |C t (u)| < k. However, because д t satisfies (k, l )-sad, even if u is a new user, a re-inserted user, a deleted user, or his/her attributes/relationships are updated, |C t (u)| ≥ k. Since this contradicts the assumption that д t satisfies (k, l )-sad, the adversary cannot infer u's identity with a confidence higher than 1 k . However, the adversary can also extract u's signatures in д t : Siд 1 (u), Siд 2 (u), . . . , Siд t (u). Since u's sensitive value is in all these signatures, the adversary can compute the intersection of these signatures: Siд(u) = Siд 1 (u) ∩Siд 2 (u) ∩· · ·∩Siд t (u). As the adversary can infer the sensitive attribute' value of u with a confidence higher than 1 l , the intersection must have less than l distinct sensitive values. However, since д t satisfies (k, l )-sad, Siд 1 (u) = Siд 2 (u) = · · · = Siд t (u) and |Siд t (u)| ≥ l. Therefore, for each KG G i ∈ д t , Siд(u) = Siд i (u) and |Siд(u)| ≥ l. This also contradicts the assumption that д t satisfies (k, l )-sad. Therefore, we can conclude that if д t satisfies (k, l )-sad, then the adversary cannot exploit his/her background knowledge (Definition 2.7) to infer identities and sensitive values of users in U t with a confidence higher than 1 k and 1 l , respectively.
We are now left to prove that TAKGA always ensures that the sequence of generated anonymized KGs д t satisfies (k, l )-sad. To this end, we first prove some lemmas. First, we prove that the Clusters Generation Algorithm (Algorithm 1) returns clusters of new users such that all of these clusters have at least k users and their users share at least l distinct sensitive values (Lemma 5.2). Then, for each user u whose data are published in д t −1 and not in G t , we prove that there are at least k − 1 other deleted users whose anonymized data in д t −1 are identical to those of u (Lemma 5.3). Furthermore, we prove that each updated/re-inserted user u whose data are published in both д t −1 and G t is in a cluster that have at least k users whose anonymized data in д t −1 are identical to those of u, and the sensitive values of these users are identical to u's previous signatures in G t −1 (Lemma 5.4). Finally, we use Lemmas 5.2, 5.3, and 5.4 to prove Theorem 5.5, i.e., TAKGA always generates series of anonymized KGs д t satisfying (k, l )-sad even though data providers insert/delete/update/re-insert some users. . If c u new has from k to 2 × k − 1 users, then it is created by Function Merge_Split() when its users share at least l distinct sensitive values (line 6, Function 1) or it is created by Procedure As-sign_Valid_Clusters_Or_Add_Fake_Users() when the procedure adds fake users to it (line 8, Procedure 1). Otherwise, c u new is the result of the splitting of a cluster having at least 2 × k users. c u new has at least k users and its users also share at least l distinct sensitive values (line 13, Function 4). As this contradicts our assumption, we can conclude that for each new user u new , his/her cluster has at least k users and the cluster's users share at least l distinct sensitive values. and relationships' out-/in-degrees in д t −1 , and the sensitive values of c u updat ed 's users are identical to u updat ed 's previous signature in G t −1 .
Theorem 5.5. Let д t be a series of anonymized versions of a KG, generated by TAKGA, from time 1 to time t. д t satisfies (k, l )-sad.
Proof. We prove this theorem by induction on t. Base case: At time t = 1, Algorithm 1 returns a set of new clusters C 1 for all the users in д 1 , as all users' data are not yet published. By using the KGG [11], TAKGA generates an anonymized KG G 1 . Suppose that д 1 = (G 1 ) does not satisfy (k, l )-sad. Then, there is a user u new ∈ U 1 such that Let c u new be u new 's cluster in C 1 . c u new has at least k users and its users share at least l distinct sensitive values (according to Lemma 5.2). As KGG ensures that c u new 's users have the same attributes' values and relationships' out/in-degrees in G 1 , c u new ⊆ C 1 (u new ) [11]. Thus, C 1 (u new )'s users also share at least l distinct sensitive values that results in |Siд 1 (u new )| ≥ l. Moreover, as new users' data are not published in previous KGs, for every user v ∈ C 1 (u new ), I 1 (u new ) = I 1 (v). As a result, C 1 (u new ) = C 1 (u new ) and |C 1 (u new )| ≥ k. Therefore, we can conclude that at time t = 1, д 1 satisfies (k, l )-sad.
Induction step: Suppose that д t −1 = (G 1 , . . . ,G t −1 ) satisfies (k, l )-sad, we need to prove that д t satisfies (k, l )-sad even though data providers insert new users, delete users, update users, or re-insert deleted users. Similarly to the proof at time 1, we can easily prove that all new users are protected according to (k, l )-sad at time t.
Suppose that there is an updated user u updat ed ∈ V U t , such that |C t (u updat ed )| < k, his/her signature at time t (i.e., Siд t (u updat ed )) has less than l values, or it is different from his/her signature at time t − 1. Let c u updat ed be u updat ed 's cluster in C t returned from Algorithm 1. Since д t −1 satisfies (k, l )-sad, for every user v ∈ c u updat ed , I t −1 (u updat ed ) = I t −1 (v), |c u updat ed | ≥ k, and Siд t (u updat ed ) = Siд t −1 (u updat ed ) (according to Lemma 5.4). Since KGG makes attributes' values and relationships' out-/in-degrees of users in the same cluster identical, for every user v ∈ c u updat ed , I t (u updat ed ) = I t (v). Thus, c u updat ed ⊆ C t (u updat ed ) ⊆ C t (u updat ed ), |C t (u updat ed )| ≥ k, and the signature at time t of C t (u updat ed )'s users is identical to that at time t −1. As д t −1 satisfies (k, l )-sad, the signature at time t − 1 has at least l sensitive values. Thus, users in C t (u updat ed ) also share at least l distinct sensitive values, i.e., |Siд t (u updat ed )| ≥ l. So, u updat ed cannot exist.
Suppose there is a deleted user u deleted at time t such that |C t (u deleted )| < k. According to Lemma 5.3, as д t −1 satisfies (k, l )-sad, there is a set of deleted users c u del et ed such that As all users in c u del et ed are not published in G t , I t (u deleted ) = I t (v) = ∅, i.e., I t (u deleted ) = I t (v), but this contradicts our assumptions. Therefore, u deleted U t . Therefore, we can conclude that at time t, д t satisfies (k, l )-sad even though data providers insert/deleted/updated/re-inserted users.

EVALUATION
In this section, we first present an experiment to evaluate the impact of TAKGA's parameters (i.e., A, k, l, τ ) on the quality of anonymized KGs. Then, we assess the effects of the number of monitored anonymized versions of a KG on their quality. Finally, since Reference [10] only illustrates a theoretical algorithm without showing its evaluation results, the last experiment compares the effectiveness of TAKGA with CKGA [11]. However, CKGA addresses static KGs, therefore we compare TAKGA and CKGA in anonymizing a KG snapshot. Before illustrating the experimental results, we describe the used datasets, the metrics we use to evaluate the anonymized KGs' quality, and the clustering settings.

Datasets
As KGs can model many types of graphs, we use four real-life datasets to evaluate the effectiveness of our algorithm, namely Email-temp [15], Yago [9], Email-Eu-core [15], and Freebase [2]. Email-temp is a directed graph that contains sent emails and their timestamps between members of a research institution. Yago is a KG containing information and its timestamp derived from Wikipedia, WordNet, and other data sources. Email-temp and Yago are organized in snapshots with different numbers of edges and nodes. Each snapshot is generated by adding and removing nodes/edges to/from its previous snapshot. Email-Eu-core is a static directed graph modelling the sent emails between members of a research institution, whereas Freebase is a static KG consisting of attributes' values (e.g., nationality, location) and relationships (e.g., spouse, parent) of famous people (e.g., the film director Anthony Asquith). We choose department as the sensitive attribute for both Email-temp and Email-Eu-core, since all previous anonymization approaches [13] used it as the sensitive attribute to be protected. That of Yago and Freebase are chosen as isCitizenO f and location, respectively, because location is a popular sensitive attribute in locationbased anonymization approaches and all users in Yago and Freebase have these attributes' values. We use Email-temp and Yago to tune parameters of TAKGA, whereas Email-Eu-core and Freebase are used to compare our algorithm with CKGA [11]. Table 2 illustrates properties of the considered datasets.
In practice, data providers often publish a snapshot of their KGs when the snapshot has enough edges. However, some snapshots of Email-temp and Yago have very few edges and it is impractical that a data provider will publish them directly. Therefore, we assume that the data provider publishes a snapshot of their dataset every time the number of edges in their dataset reaches a specific threshold, and the threshold will be the same for generating all of the snapshots. We generate 20 snapshots for each dataset by merging continuous raw snapshots such that the standard deviation of the number of edges of the generated snapshots is minimized. All generated snapshots have a similar number of edges.
To illustrate the changes of nodes/edges of the generated snapshots, we calculate the ratio of added/removed nodes and edges. The ratio of added/removed nodes (edges, respectively) of a snapshot is calculated by summing up its added and removed nodes (edges, respectively) and dividing it by its number of nodes (edges, respectively). Figure 4 illustrates the ratio of added/removed edges (Figure 4(a)) and nodes (Figure 4(b)) for the generated snapshots. Email-temp's snapshots change more edges and nodes than those of Yago. The average ratio of added/removed edges of Emailtemp's snapshots is 0.89 while that of Yago is 0.03, whereas the average ratio of added/removed users of Email-temp's snapshot is 0.14 while that of Yago is 0.0001. define AIL as follows: where To evaluate the accuracy of classification tasks on anonymized KGs, we exploit the Relational Graph Convolution Network (RGCN) [19], the state-of-the-art deep learning model for KGs' classification tasks. RGCN learns users' features by combining the attributes of users and their neighbors. The learned features are used to classify these users. Following the approach described in Reference [8], to train RGCN on an anonymized KG, we use the sensitive attribute values as the classification labels. Then, we randomly split KG into training, validation, and test sets. We use the training set to train RGCN and monitor the cross entropy function's loss on the validation set. The training is finished when the loss is not decreased for 10 consecutive epochs. Finally, we measure the accuracy of the trained model on the test set. Since each user in an anonymized KG belongs to many labels, we exploit [14] to calculate the average accuracy of the prediction on all labels.

Clustering Settings
Since TAKGA leverages on clustering, we choose two state-of-the-art clustering algorithms from two different clustering approaches (i.e., the centroid-based and density-based approach) to run our experiments. The first algorithm is k-Medoids [7], a traditional centroid-based clustering algorithm that chooses a centroid user for each cluster and minimizes the distances between the user and the remaining ones in the same cluster. As a result, the maximum distances of generated clusters are also minimized. The second one is HDBSCAN [3], that is, the state-of-the-art density-based clustering algorithm that gathers users who are close into the same clusters. Thus, a user can be added to a cluster if his/her distance to one of the users in the cluster is less than a density threshold.
However, these algorithms require specifying some parameters. Although we can keep some parameters' values that algorithms' authors recommend, others are specifically required for TAKGA's anonymization. k-Medoids takes as input the number of clusters to be generated, whereas HDB-SCAN receives the minimum size of these clusters. As TAKGA receives k as the minimum size of its generated clusters, we assign k to the minimum size parameter of HDBSCAN. The number of clusters generated by k-Medoids is calculated by dividing the number of users in the KGs by k.
Moreover, to show the impact of these algorithms on the quality of anonymized KGs, To this end, we implement a baseline approach, hereafter called Invalid Removal, that naively removes invalid clusters instead of modifying them by calling Function Merge_Split() (line 6, Algorithm 2). We use this approach to evaluate the quality of anonymized KGs generated by HDBSCAN and k-Medoids. Then, we use the second approach, namely Merge and Split, implemented by Algorithm 2 that modifies the input clusters by using Function Merge_Split() (Function 1). By using the Invalid Removal and Merge and Split approach, our experiments show the impact of not only these clustering algorithms but also TAKGA's clusters' modification.

Tuning TAKGA
In this experiment, we aim at evaluating the effects of TAKGA parameters: the clustering algorithm A, k, l, and τ , on the quality of anonymized KGs, on two datasets: Email-temp and Yago. Due to the lack of space, Figures 5-8 illustrate the experimental results of all snapshots on Email-temp, while the summary experimental results on Yago dataset is reported in Table 3. Effects of the provided clustering algorithm A. TAKGA allows data providers to specify their own clustering algorithm A. To study the impact of different clustering algorithms on the quality of anonymized KGs, we need to reduce the impact of other parameters (i.e., k, l, and τ ). To this end, we fix k = 2, l = 1, τ = 1 so k, l are at their minimum values and τ does not impact the information loss of anonymized KGs. Figure 5 illustrates the quality of the obtained anonymized KGs for the Email-temp dataset. Since the quality of anonymized KGs generated from Yago does not change too much from time 2 to 20, we only show the quality, calculated by AIL and AAIL, of anonymized KGs generated at time 1 and the average quality of those generated from time 2 to 20. The results are reported in Table 3(a).
The Merge Split approach generates higher quality anonymized KGs than those generated by the Invalid Removal approach on both datasets. By using the Invalid Removal approach, at t = 1, AIL of the anonymized KG generated by executing TAKGA with k-Medoids (i.e., 0.2368 in Email-temp and 0.4799 in Yago) is higher than that of the one generated by executing TAKGA with HDBSCAN (i.e., 0.0017 in Email-temp and 0.1610 in Yago), because HDBSCAN's clusters have at least k users who are not removed by the Invalid Removal approach. In contrast, by using the Merge Split approach AIL and AAIL of the anonymized KG generated by k-Medoids are decreased to 0.0017 and 0.0001, respectively, on Email-temp (0.0003 on Yago) while those of the one generated by HDBSCAN are decreased to 0.0001 on Email-temp (0.0008 and 0.0003, respectively, on Yago). From time 2 to 20, the average AIL and AAIL of anonymized KGs generated by running TAKGA with k-Medoids are decreased from 0.2810 and 0.0022 to 0.1286 and 0.0010, respectively, on Email-temp (from 0.9192 and 0.0038 to 0.7599 and 0.0003, respectively, on Yago). The same trend is observed for those generated by HDBSCAN.
Therefore, the experiments show that running TAKGA with k-Medoids and the Merge Split approach generates the highest quality anonymized KGs. Therefore, we use this setting to run the remaining experiments.
Effects of k. TAKGA uses k to decide the minimum size of generated clusters. To evaluate the impact of k, we fix l = 1, τ = 1, and we evaluate the quality of anonymized KGs generated by running TAKGA with k-Medoids on varying values of k, namely 2, 4, 6, 8, 10, which are commonly used by various graph anonymization approaches [4,11,13]. Since τ = 1, TAKGA does not remove users nor add fake ones. Figure 6 illustrates AIL and AAIL on Email-temp dataset. Since the results from time 2 to 20 on Yago dataset are very similar, we show them in in Table 3(b).
The higher k's values, the more information users lose in both datasets. At t = 1, increasing k from 2 to 10 increases AIL (AAIL) from 0.0017 to 0.0261 (from 0.0001 to 0.0010, respectively) on    Increasing l does not always decrease the quality of anonymized KGs. When we increase l from 1 to 2, AIL of anonymized KGs decreases from 0.0261 to 0.0010 on Email-temp and from 0.0025 to 0.0024 on Yago. AAIL of these KGs is 0.0010 on Email-temp and decreases from 0.0025 to 0.0024 on Yago. The decrements of AIL and AAIL indicate that TAKGA can improve the quality of anonymized KGs by assigning users in invalid clusters to valid ones. When l is increased from 2 to 4, AIL (AAIL) of anonymized KGs increases from 0.0010 to 0.0015 (from 0.0010 to 0.0015, respectively) on Email-temp, whereas AIL and AAIL of those generated on Yago are 0.0024. The increments of AIL and AAIL on Email-temp are because this dataset contains outliers whose data are too much different from the remaining users. Assigning these outliers to valid clusters increases the information loss of other users.
The variation of AAIL on Email-temp is higher than on Yago, because Email-temp has only 5 sensitive values, whereas Yago has 191. These changes are also small, since, at k = 10, all generated clusters have at least 10 users who can share from 2 to 4 sensitive values. TAKGA does not modify these clusters and increasing l does not change the information loss of these clusters' users. Therefore, by using TAKGA, data providers should focus more on k when they want to generate high-quality anonymized KGs and can freely choose l's values, which are less than or equal to k.
Effects of τ . TAKGA uses τ to decide whether an invalid cluster should be removed and its users assigned to other valid clusters or fake users should be added to the cluster to make it valid. A user can be assigned to a cluster if the maximum distance between the user and the clusters' users is less than or equal to the maximum distance calculated by τ . If the number of fake users is higher than the number of non-assignable users, then TAKGA removes the cluster and assigns its users according to τ 's condition. We assess the impact of τ by fixing k = 10, l = 4 and evaluate the anonymized KGs' quality with varying values of τ : 0, 0.25, 0.5, 0.75, 1. Figure 8 and Table 3(d) show AIL and AAIL on Email-temp and Yago.
The lower τ 's values are, the more users are removed, resulting in higher values of AIL and AAIL. At t = 1, by decreasing τ from 1 to 0, AIL increases from 0.0015 to 0.1611 on Email-temp and from 0.0024 to 0.1101 on Yago. AAIL increases from 0.0015 to 0.0419 on Email-temp and from 0.0024 to 0.0095 on Yago. In particular, with τ = 1, all users belonging to invalid clusters can be assigned to other valid ones, so TAKGA always removes invalid clusters and does the assignments. At t = 1, AIL of anonymized KGs generated with τ = 1 is 0.0015 on Email-temp and 0.0024 on Yago, which are the smallest among all generated KGs when varying τ 's values. With τ = 0, all invalid clusters' users cannot be assigned to any valid cluster. In this case, if the number of users in invalid clusters is less than the number of fake users, then TAKGA will add fake users. Otherwise, these users are removed. These fake and removed users make AIL and AAIL of anonymized KGs generated at τ = 0 the highest among AIL and AAIL of those generated with other τ 's values on both datasets. At t = 1, AIL of anonymized KGs generated with τ = 0 is 0.1611 on Email-temp and 0.01101 on Yago, which are the highest among all generated KGs. Similar trends can be seen in anonymized KGs generated from time 2 to 20. Therefore, τ is effective enough for data providers to control the number of fake/removed users while keeping the information loss of users in anonymized KGs low.

Impact of the Length of the Sequence of Monitored Published KGs
The quality of anonymized KGs keeps decreasing over time when TAKGA monitors all published anonymized versions of a KG. However, in practice, it is hard for adversaries to monitor all published anonymized versions of a target KG. Therefore, data providers can improve the quality of their anonymized KGs by specifying the number of contiguous anonymized versions to be monitored, i.e., what we call the window size, denoted in what follows as w. Then, after publishing w contiguous anonymized versions of a KG, the providers refresh their anonymization and start monitoring from the current anonymized KG. In this experiment, we fix k = 10, l = 4, and τ = 1, and evaluate the quality of anonymized KGs generated on varying values of the window size w. Figure 9 and Table 3(e) show the information loss of anonymized KGs generated when varying windows' sizes: No_Reset, 1, 2, 3, 4. With No_Reset, TAKGA monitors all published anonymized versions of a KG.
Resetting the anonymization window size improves the quality of anonymized KGs. When TAKGA tracks all published anonymized versions of a KG, AIL (AAIL, respectively) keeps decreasing from 0.0015 to 0.6062 (from 0.0015 to 0.0083, respectively) on Email-temp, and from 0.0024 to 0.9763 (from 0.0024 to 0.0028, respectively) on Yago. However, when we reset the window size every four anonymized KGs, the maximum AIL (AAIL) decreases from 0.7538 (0.1291, respectively) to 0.4986 (0.0075, respectively) on Email-temp and from 0.9953 (0.1585, respectively) to 0.9941 (0.0192, respectively) on Yago. The minimum AIL and AAIL generated with a window of size 2 and 4 are similar to those of the KGs generated when TAKGA does not monitor published KGs. When AIL and AAIL are too high, they can reset the tracking anonymization window size to improve the quality of their anonymized KGs. Therefore, TAKGA allows data providers to tradeoff between privacy protection and the quality of anonymized KGs.

Impact on the Classification Accuracy
This experiment measures the impact of the privacy parameters (i.e., k, l) on the accuracy of RGCN models trained on anonymized KGs. To this end, we assign τ = 1, A = k-Medoids and select different values of k (i.e., 2, 4, 6, 8, 10) and l (i.e., 1, 2, 3, 4). For each pair of k and l, we generate anonymized versions of 20 snapshots. Table 4 illustrates the average accuracy RGCN models trained on these versions, as well as the average accuracy trained on original data. In particular, with l = 1, each user is associated with a single label (the sensitive attribute's value); thus, increasing k makes users having different features (attribute values and those of users' neighbors) have the same label and decreases the average accuracy of RGCN models trained on the anonymized KGs. Increasing k from 2 to 10 decreases the average accuracy of models from 0.73 to 0.55 on Email-temp and from 0.91 to 0.79 on Yago (Table 4(a)). This is motivated by the fact that TAKGA adds fake edges to make the attribute values and relationship degrees of every user identical to those of k − 1 other users. The fake edges increase the number of users with the same features but different sensitive attributes. Thus, the diversity of users' features in some labels is decreased, which impacts on RGCN classification.
Indeed, RGCN predicts the classification label (the sensitive attribute's value) based on user's attribute values and those of his/her neighbors. The trained models on anonymized KGs do not have enough data to classifier users in the test set. Therefore, the average accuracy of predicting these labels decreases when we increase k.
Increasing l improves the quality of the trained models. At k = 10, increasing l from 1 to 4 increases the accuracy of RGCN models from 0.55 to 0.82 on Email-temp and from 0.79 to 0.90 on Yago (Table 4(b)). When l is greater than 1, each user belongs to at least l labels. Thus, each label has more users. Therefore, the average accuracy of predicting all labels increases. The experimental results indicate that even with high values of k and l, the trained RGCN models are accurate enough to be used in practice. With k = 10 and l = 4, the average accuracy of RGCN models are 0.82 on Email-temp and 0.90 on Yago while that of those trained on original snapshots are 0.80 on Emailtemp and 0.90 on Yago. Since the difference between the average accuracy at the highest protection setting (i.e., k = 10, l = 4) and that at the original snapshots is low in all datasets, we believe that the quality of the anonymized KGs is acceptable in practice.

Comparative Evaluation
We are aware of only an anonymization algorithm for KGs (i.e., CKGA [11]) that can only be used with static KGs. In this experiment, we compare the quality of anonymized KGs generated by TAKGA and CKGA on anonymizing a single snapshot of KGs. We run TAKGA with τ = 1, A =k-Medoids, and l = 1, as CKGA does not consider users' sensitive values. Since CKGA can only be used to anonymize static KGs, we can only use Email-Eu-core and Freebase to conduct the comparative evaluation with CKGA. Figure 10 shows AIL of anonymized KGs generated by TAKGA and CKGA on Email-Eu-core and Freebase on varying k's values. CKGA's results are taken from [11].
TAKGA generates higher quality anonymized KGs compared to those generated by CKGA on both datasets. When k = 10, AIL of the anonymized KGs generated by TAKGA (0.0135 on Email-Eu-core and 0.0065 on Freebase) is 87.85% (40.37%) lower than that of the one generated by CKGA (0.0535 on Email-Eu-core and 0.0109 on Freebase) on Email-Eu-core (Freebase). On Email-Eu-core, executing k-Medoids with k = 7 requires TAKGA to remove more users than executing it with k = 8, 9. As a result, the AIL at k = 7 (0.0108) is higher than that of those generated with k = 8, 9 (0.0043, 0.0046, respectively). This is because CKGA always removes invalid clusters and assigns their users to other valid ones. In contrast, TAKGA measures the costs of two approaches that make invalid clusters valid: assigning these clusters' users to other valid clusters and adding fake users to make invalid clusters valid. Then, it performs the approach that has the smallest cost.

RELATED WORK
This section discusses anonymization approaches related to this proposal (i.e., anonymization for static data, for sensitive values, for dynamic data).

Protecting Users' Privacy in Anonymizing Static Data
There are two main anonymization principles: k-anonymity and DP [13]. k-anonymity was first introduced to protect users in relational data. It modifies the original relational data such that for every user, his/her quasi-identifier values (i.e., attributes that can be used to re-identify an users) in the anonymized version are identical to those of at least k−1 other users. k-degree, k-neighborhood, and k-automorphism [13] apply the k-anonymity principle to undirected graph, by preventing users from being re-identified when adversaries exploit users' degree and the neighbor structure. To this end, they modify users' degrees and neighbor structures in the anonymized undirected graph such that they are identical to those of at least k − 1 other users in the graph. In contrast, Paired k-degree [4] targets directed graphs and extends k-degree to protect users in anonymized directed graphs when adversaries exploit out-/in-degree of their target users. k-Attribute Degree [11] extends Paired k-degree to protect KG. It adds/removes users' edges such that users' attribute values and relationship out-/in-degrees in the anonymized KG are indistinguishable to those of at least k − 1 other ones.
DP [13] takes as input statistics that data recipients want to extract from original data and designs differential private algorithms extracting the statistics such that adversaries cannot infer the existence of users by monitoring these statistics. In particular, DP adds noises such users' statistics are pretty similar. The similarity is estimated based on a privacy threshold ϵ and the statistic types. The higher similarity, the higher protection, and the more noises are added. DP has been first used to extract statistics, such as the number of users, the histogram of an attribute, in relational data [13]. Then, it has been applied to undirected graphs. For instance, References [5,6,12] extract graphs' statistics (i.e., degree distribution, attributes of communities' users, and number of triangles) such that they are similar to those of the graph after removing one of its users. Reference [18] exploits DP to extract a deep learning model's parameters from KGs. The parameters can be shared between data providers to improve the quality of their models without leaking the existence of their users. Although DP gives stronger privacy protection than k-anonymity, data recipients must rely on privacy experts to design differential private algorithms for their required statistics. Moreover, many popular KGs' analyses (e.g., drug discovery, tax calculation) are not supported by DP and most data providers share KGs without any anonymization (e.g., Yago [9] and Google Knowledge Graph).

Protecting Users' Sensitive Values in Anonymized Data
Several principles have been introduced to protect users' sensitive attribute's values in anonymized relational data, undirected graphs [13], and KGs [22]. The distinct l-diversity [13] is the first principle that extends k-anonymity to protect users' sensitive values in anonymized relational data. To this end, it ensures that users whose anonymized attribute values are identical share at least l sensitive values. Therefore, adversaries cannot infer the sensitive value of their victim with a confidence higher than 1 l , even if they exploit the victim's anonymized attribute values. However, if the frequency of the victim's sensitive value is too high, then adversaries may infer that his/her sensitive value is likely to be the high-frequency value. (α, k )-anonymity [13] remedies this issue by ensuring that the frequency of the values among users having the same anonymized data are less than or equal to α. Similar protections have been presented in the Entropy l-diversity and the Recursive l-diversity [13]. Nevertheless, if the sensitive values of users whose anonymized data are identical belong to the same sensitive category, then adversaries can still infer the category. For instance, even though adversaries cannot infer the sensitive value of their victim among HIV − 1 and HIV − 2, they know that their victim's disease is HIV . To address this issue, (l, d )-semantic diversity [17] categorizes these values into a hierarchy structure and ensures that the minimum distance of two sensitive values of users having the same anonymized data are higher than or equal to d.
We are not aware of proposals supporting the l-diversity principle for KGs. In contrast, k-degree-l-diversity [24] extends k-degree and l-diversity [13] to protect users' sensitive values in anonymized undirected graphs, by ensuring that for every user in the graph, his/her degree is identical to that of at least k − 1 other users and these users share at least l sensitive values.
This article follows the k-degree-l-diversity principle to allow data providers to specify both k and l to protect their users. To prevent users whose sensitive values are too popular from being inferred, the providers can keep increasing l until it reaches k. When l is equal to k, all sensitive values have the same frequency among users having the same attribute values and relationship out-/ in-degrees. In addition, data providers can easily address the issue of similar meaning sensitive values discussed above by putting these values into the same group and replacing users' sensitive values with their group name, or exploiting a solution as the one proposed in Reference [17]. For example, HIV − 1 and HIV − 2 are two similar-meaning values of the sensitive attribute disease. We can put these values into a group, namely HIV , update KGs by replacing the sensitive values of users associated with HIV − 1, HIV − 2 with HIV , and execute our algorithm with updated KGs to generate their anonymized versions. In this case, our algorithm considers HIV − 1 and HIV − 2 as a single value HIV . If l is higher than 1, then users having the same attribute values and relationship out-/in-degrees cannot share only HIV − 1 and HIV − 2. Therefore, adversaries cannot infer users' sensitive values with a confidence higher than 1 l .

Protecting Users' Privacy in Anonymizing Dynamic Data
Anonymization principles have been proposed to publish relational dynamic data in which data providers sequentially publish new anonymized versions after they update their data. minvariance [13] extends l-diversity to protect users' sensitive values in dynamic relational data while allowing data providers to insert/update/delete their users. m-invariance has then been extended [1] to allow data providers to re-insert users in their anonymized relational data without losing the protection for these users' sensitive values. Instead of protecting users' sensitive values, [16,21] consider protecting their identities in dynamic undirected graphs. k w -SDA [21] protects users from being re-identified with a confidence higher than 1 k when adversaries monitor w continuous anonymized undirected graphs satisfying the k-degree principle. However, k w -SDA restricts these graphs from having only one attribute. This restriction is solved in [16] by creating a label concatenating users' attributes' values. Recently, k w -tad [10] has been proposed, which extends k w -SDA [21] and k-ad [11] to ensure that users cannot be re-identified with a confidence higher than 1 k when adversaries monitor the attribute values and relationship out-/in-degrees of these users in w continuous anonymized snapshots satisfying k-ad [11]. k w -tad also allows data providers to insert/update/delete/re-insert their users. Unfortunately, adversaries can still infer the sensitive values of their target users if the values of users having the same monitored attribute values and relationship out-/in-degrees to those of the target users are identical.
In this article, we extensively extend k w -tad [10] to also protect users' sensitive values when adversaries can monitor all published anonymized snapshots. This extension requires our anonymization algorithm (i.e., TAKGA) to ensure k and l properties for inserted, deleted, and updated/reinserted users. First, for every new user who is first inserted at time t, we must ensure that there are at least k − 1 other users who have the same anonymized data in the anonymized version at time t and these users must share at least l sensitive values. Second, we must delete more users to ensure that for every deleted user at time t, there are at least k − 1 other deleted users who have the same anonymized data in previous anonymized versions. While k w -tad [10] randomly deletes these users, we have to carefully select the deleted users such that the remaining ones