Refining Large Integrated Identity Graphs using the Unique Name Assumption

,


Introduction
The question "What is an entity? " and the related question "When are two entities equal? " are not only longstanding philosophical questions 3 but are also longstanding technical issues in information systems [7]. The Semantic Web, and in its wake, Linked Open Data, have operationalised the notion of an "entity" as an Internationalized Resource Identifier (IRI): each is represented as an IRI, and using the same IRI implies referring to the same entity. Entities are connected by the identity links (e.g. owl:sameAs) to form identity graphs. Many existing approaches for detecting errors in identity graphs require information such as vocabulary alignments, textual descriptions [17,8] or the presence of a large number of ontology axioms and alignment of the vocabularies [11,14]. However, such information is often restricted to certain languages or simply not always available [17,8], thus not appropriate for refinement tasks at web scale. Identity graphs on the web exhibit special properties which must be considered: they are integrated from multiple sources, sources can be multilingual, many suffer from a lack of maintenance and some have multiple encoding schemes.
Since owl:sameAs is a symmetric relation, we reduce the directed graph to a simple, undirected graph. In an undirected graph G, a Connected Component (CC) is a maximal subgraph with any two vertices connected by a path (Figure 1a). A gold standard is the ground truth that maps each node (IRI) to the real-world entity, which can be used for evaluation (Figure 1b). An equivalence class (EC) is a set of vertices corresponding to the same real-world entity (may or may not be connected by a path). In an identity graph, a CC is an EC if and only if all its nodes refer to the same real-world entity 4 .  The Unique Name Assumption (UNA) supposes that two terms with distinct IRIs do not refer to the same real-world entity. Although the UNA does not always hold due to redundant IRIs that capture various encodings, languages, namespaces, versions, letter cases, the UNA can still be useful for identifying erroneous links. We design a refinement algorithm that removes a minimal number of edges with good precision (Figure 1f). We compare the results against the Louvain algorithm (Figure 1c and 1d) and the Leiden algorithm (Figure 1e). This paper focuses on four research questions: RQ1 How can we define a UNA for large integrated knowledge graphs? RQ2 How do we validate various definitions of the UNA? RQ3 Can the UNA give a reliable indication of errors in practise? RQ4 Can we develop an efficient UNA-based algorithm for refinement?
We present existing definitions of the UNA and related work in Section 2. In Section 3, we propose a new definition of the UNA and we test the different UNA definitions and examine their reliability for error detection in Section 4, by validating them over data of the LOD cloud. In Section 5, we present our refinement algorithm and we evaluate it in Section 6. Finally, discussion and future work are presented in Section 7. Our main contributions 5 are as follows: 1. We propose a new definition of the UNA, namely the iUNA and check it against a large integrated knowledge graph together with other definitions. 2. We design an inconsistency-based refinement algorithm that evaluates definitions of the UNA by employing an SMT solver. 3. We publish a gold standard of over 8K manually annotated entities (200K owl:sameAs links) together with some additional information such as redirection and equivalence under different encoding schemes. 4. We introduce new evaluation metrics and provide a benchmark using our gold standard and algorithm.

Related Work
Estimates of the proportion of erroneous identity links in the semantic web range from around 3% [11,15] to 20% [10]. Existing approaches for detecting errors in identity graphs fall into three categories [17]. Content-based approaches exploit the descriptions associated with each resource for evaluating the correctness of an identity link. They typically rely on additional information such as vocabulary alignments and textual descriptions for each entity. However, such information is not always available [17,8]  degree is based on the density of the community in which an identity link occurs in, and the weight of the owl:sameAs (i.e. reciprocally asserted owl:sameAs have a lower error degree, hence a higher chance of correctness). These error degrees are published online as part of the MetaLink dataset [3]. However, the accuracy of these methods is limited due to a lack of understanding of the underlying semantics. Finally, the inconsistency-based approaches [11,14] hypothesise that owl:sameAs links that lead to logical inconsistencies have a higher chance of being incorrect. They typically require the presence of a large number of ontology axioms and alignment of the vocabularies.
The use of the UNA to detect errors in identity graphs is an inconsistencybased approach. This idea has been explored in [12,19]. Despite that UNA is a well-defined definition in relational database theory (a.k.a. Unique Name Axiom) [18], the lack of an agreed-upon definition of UNA in semantic web leads to different conclusions. The primitive adaption of UNA in semantic web postulates that any two ground terms with distinct names are non-identical [12]. In the scope of integrated knowledge graphs, Valdestilhas et al. [19] formalise this as any two URIs in the same knowledge base cannot refer to the same thing in the real world. We name this definition naive UNA, or nUNA for short. In practice, an integrated knowledge graph violates the nUNA if at least one of its connected components (from the identity graph) has two entities from the same source. Figure 2 is a fictional example of six entities from two knowledge bases (corresponding to nodes in light grey and dark grey, respectively). The six entities connected by the black edges form a connected component. The two equivalence classes are about the Netherlands (the three nodes on the right), and a city in Texas named Holland (the two nodes on the left). The node ex:Holland can be confusing (could be annotated as "unknown"). The blue arrow is an example how encoding schemes can lead to redundancy. Due to transitivity, the mistake between ex:Holland, Texas, ex:Holland and ex-fr:Pays-Bas was carried over to other entities such as ex-nl:Nederland. This example shows how entities in various languages can be confusing. This connected component violates the nUNA: for the knowledge base of light grey, there are three entities in the connected components. This helps the detection of spurious links. Note that removing the links between ex:Holland, Texas and ex:Holland and ex-fr:Pays-Bas results in three connected components, which are correct but still violate the nUNA.
De Melo [12] points out that the Semantic Web is very different from traditional closed scenarios because multiple parties can publish data about the same entity using different identifiers. Thus, they propose to use a quasi-unique name constraint (quasi UNA, or qUNA) for entities: they use the namespace of an IRI as its source of provenance, with a focus on 6 major hubs including DBLP, DBpedia, FreeBase, GeoNames, MusicBrainz, and UniProt. This definition also takes into account some exceptions: two DBpedia entities from the same dataset/source do not violate the UNA if one redirects to the other, or either is a dead node (those that can no longer be resolved).
These definitions have several drawbacks in practice. First, both the nUNA and the qUNA lack a clear definition of provenance, i.e. the source of entities. The algorithm using the nUNA relies on LinkLion 6 for computing the provenance of entities [19]. That of the qUNA takes an entities' namespace as the source by default. As for DBpedia, the paper studied only the namespace http://dbpedia.org/resource/ for violation and redirection. The algorithm developed based on nUNA outputs only partitions of the identity graph rather than the edges to remove [19]. Despite that the paper proposed to handle cases of DBpedia with exception, qUNA is restricted to awareness of redirect within DBpedia [12]. In fact, recent work estimates that between 45% and 83% of redirection links can be taken as identity link 7 [13]. Furthermore, the work in [12] does not specify how redirection and dead nodes were obtained. In addition, we believe that there are other forms of exceptions that must be considered. For example, the IRIs wikidata.dbpedia.org/resource/Q6453410, www.wikidata.org/ entity/Q6453410 and wikidata.org/entity/Q6453410 are about the same entity but in different versions of Wikidata. Despite issues with the definition, the refinement algorithm using these two UNA definitions takes violations as hard constraints: entities are considered different as long as the UNA is violated. Due to the lack of a gold standard, neither definition was validated on real-world data, or compared with other existing baselines. In this work, we propose a new definition of the UNA that is suited for large integrated graphs on the Web and compare it with the existing UNA variations previously proposed by [12,19].

The iUNA
When examining the data in the LOD Cloud, we note that identity links are often used to connect the same entity in different language, versions or encodings. Therefore, we propose our own definition of the UNA, which we call the internal UNA (iUNA), to take these differences into account. Our iUNA definition assumes that two different IRIs e1 and e2 within the same namespace should refer to distinct real-world entities only when: a) they are in the same knowledge base according to a certain provenance information, b) they don't satisfy any of the following exceptions: 8   9 2. if e 1 redirects to e 2 (or vice versa), or both redirect to the same location, 3. if at least one of e 1 and e 2 is a dead node, not found, unresolvable, redirects until reaching some error or has a timeout error while resolving.
To check whether two entities violate the iUNA, condition (a) requires us to check whether they are from the same knowledge base. This requires some form of provenance to determine where an entity is defined. The nUNA relies on the provenance information of LinkLion, which consists of multiple linksets. It is questionable if linksets can in fact be taken as the knowledge base where the entities are defined, not to mention that LinkLion is no longer available. As for the qUNA, it takes the namespace of an entity to define its knowledge base (regardless of the actual knowledge bases where the corresponding identity links are). This can be problematic for popular namespaces: an entity in DBpedia can be defined in one knowledge base but used in other knowledge bases. Authors can specify where an entity is defined using rdfs:isDefinedBy, but an adhoc examination shows that this information is rare. We therefore propose two additional means for the estimation of provenance of an entity e. Table 1 provides a comparison of the three UNA definitions.
Explicit sources: an explicit source of e is the object in any triple with subject e and predicate rdfs:isDefinedBy (or any equivalent or sub-properties). Implicit label-like sources: an implicit label-like source of e is the RDF file containing triples where e is the subject and rdfs:label (or any of its equivalent or sub-properties) is the predicate. Implicit comment-like sources: an implicit comment-like source of e is the RDF file containing triples where e is the subject and rdfs:comment (or any of its equivalent or sub-properties) is the predicate. 4 Testing the UNA

Dataset & Gold standard
We use the http://sameas.cc dataset [4], which provides the transitive closure of 558 million distinct owl:sameAs statements. These identity statements were extracted from the 2015 LOD Laundromat crawl [2] that provides more than 38 billion triples from over 650K RDF files. The identity links are distributed over 49 million connected components (CCs), with each CC being associated with a unique ID. We manually annotated all IRIs from 28 CCs with fewer than 1K nodes each. Our gold standard consists of 8,394 manually annotated entities covering a total of 232,311 owl:sameAs links. There are 987 entities (11.75%) annotated as 'unknown'. A total of 209,160 edges (90.02%) are between nodes with the same annotation while 3,678 edges (1.58%) link entities with different manual annotations. The remaining edges involve at least one node annotated as 'unknown'. Based on this manual examination, we estimate the error rate to be between 1.58% and 9.98%. We divide our gold standard randomly into two parts of 14 files each for training and evaluation respectively. To better understand the gold standard, we show their size ECs and their distribution in Figure 3. The plot shows that redundancy is common in the LOD cloud. The majority of ECs contain fewer than 200 nodes, while there could be as many as 358 identifiers referring to the same real-world entity at the right end of the spectrum. This gives a reference for the setting of parameters in our algorithms in Section 5.

Validating the UNA
Using the gold standard, we validate our definitions (RQ2). For this, we use the sources of entities in our gold standard retrieved also from LOD Laundromat. Our examination shows that only 0.71% of the entities have an explicit source. In contrast, 61.97% of the entities have at least one implicit label-like source and 40.71% have a comment-like source. This indicates that explicit sources are too rare and thus we only use two variants of iUNA in this work: iUNA-label and iUNA-comment corresponding to label-like sources and comment-like sources respectively.  For each source, we analyze the number of entities in each EC. Although the original work that examines qUNA was restricted to only 6 major hubs' namespace as provenance, it can be easily adapted to any namespace. Thus, we generalize its definition of provenance in the experiments below. Considering that the nUNA lacks a proper definition of provenance, we use the label-/comment-like source defined for iUNA for the sake of comparison. Table 2 provides the proportion of sources with the number of entities in each implicit label-/comment-like source in the equivalence classes. A source follows the UNA if there is only one unique entity in the EC. An estimate of 1,351 out of 1,737 label-like sources follows the nUNA. On the other hand, 14.40% of the sources violate the nUNA by having two entities in at least one equivalence class in the gold standard, and an additional 7.82% of the sources violate the nUNA by having more than two entities. Table 2 shows that the iUNA is better than the nUNA and the qUNA in terms of capturing how the community is implementing the UNA in their knowledge bases. This also shows that taking encoding equivalence and redirection can indeed align the UNA with its use in practice. Thus, the algorithm should not remove all edges that violate the UNA when refining the identity graphs.

Detecting Errors Using UNA
In this section, we focus on RQ3: can the UNA give a reliable indication of identity errors in practice? Our analysis shows that the errors can be classified as two types. The first type are erroneous edges between entities that refer to two real-world entities. The others are edges involving nodes annotated as 'unknown'. Thus, we provide upper and lower bound of error rate depending on how these edges are treated. First, we study how two random entities in a connected component are identical. For this, in each connected component G in the gold standard, we sample |V | (i.e. the number of nodes) different pairs of entities at random. The estimated error (proportion of non-identical pairs) is between 47.0% and 68.1%, depending on the interpretation of the nodes labeled "unknown" in the gold standard. We use this as our baseline for the analysis below (see the first row of Table 3).
For these same sampled pairs, we test the error rate and the UNA violation percentage for the three UNA definitions. The second row in Table 3 shows that when using label-like sources, 61.9% of the sampled pairs violate the nUNA, the  .75% nodes were annotated "unknown". This analysis also indicates that such nodes are heavily involved in pairs violating the UNA. More pairs violate the UNA when using label-like sources than when using comment-like sources. In all cases, the lower bounds of error reduce when compared against that of randomly sampled pairs. Using iUNA with comment-like sources reaches the lowest error rate for the lower bound. These selected pairs are then used in the algorithm to identify erroneous edges in the paths that connect them. Next, we study the impact of redirection. There are in total 13,922 nodes in the graphs that capture redirect relations 10 . We find that 3,072 out of 8,394 entities were redirected. Among them, 5,528 correspond to new IRIs that are in the extended graph but not in the original graphs. There are in total 6,991 edges in the redirect graphs. Among them, 546 are between entities in the original graph with 504 correct ones and 8 erroneous ones. That is, the error rate is between 1.47% and 7.69%. In addition, we have 12,531 pairs of entities that redirect to the same entity in the extended graph. The error rate is between 4.29% and 6.32%.
Next we study the equivalent entities suffering from different encodings (recall the example given in Figure 2). We have 1,818 pairs of entities in the gold standard. 11 Among them, there are edges between 1,130 pairs in the original identity graphs with an error rate between 2.21% and 8.50%. We discovered 688 new pairs that differ only by encoding with an error rate between 1.16% and 14.83%. Finally, there is a pair of entities whose IRIs in alternative encoding are the same but they actually refer to different real-world entities. We conclude that though the exception do not always hold, they are often useful.

Algorithm Design
We limit the scope of refinement algorithms in this paper to removing erroneous identity links and forego identifying erroneous entities or adjoining additional Algorithm 1: partition 1 Input: an identity graph G, a weighting scheme w, a graph of redirect G R , a graph of equivalence under various encodings G E Result: status s, a set of edges removed A, the graph of partitions GP 2 initiate A as an empty set (to store removed edges); 3 initiate Hccs as a set of the connected components of G; 4 while |A| is increasing (no new edge to remove) and Hccs is not empty do links. The intuition is that for two inter-connected clusters, if there is more force pushing them apart than holding them together, then some edge(s) should be removed to split the clusters apart. The "force" that pushes the clusters apart are between pairs of entities violating the UNA. These pairs might not be directly connected, but they can be connected through multiple paths. The removed edges as the output of the algorithm is a cut for the graph. Computing an optimal cut whose removal makes the graph consistent within each CC is APXhard (i.e. where there are polynomial-time approximation algorithms) [12]. We can encode this problem (as soft and hard clauses) to an optimization problem and employing an SMT solver [5]. The goal is to maximise the sum of weights over all soft clauses while satisfying all the hard clauses. We choose this approach because it enables fast reasoning over weighted constraints of relations of equality and inequality and it returns a sub-optimal answer in case of timeout.

Algorithm using UNA
Since the iUNA/nUNA requires the same parameters, we present the algorithm using the iUNA. That of qUNA can be derived simply by removing the parameters of redirect graphs and that of encoding equivalence. Algorithm 1 takes as input a graph G, the corresponding redirect graph G R , the graph of equivalence under various encodings G E , and a weighting scheme w. As a first step, we load H css with the connected components of G. We obtain the corresponding subgraphs H R cc , H E cc from G R , G E respectively. G ccs , together with G R cc , G E cc and the weighting scheme is then taken as the input of Algorithm 2. The removed edges are collected in A. The algorithm stops when no more edges can be removed.
In the while-loop of Algorithm 1, there is a repeated call to Algorithm 2 that examines each graph of a connected component in H ccs (line 7). Algorithm 2 takes advantage of an SMT solver's power of reasoning over weighted relations of equality and returns a solution within a given time bound. We first randomly sample some pairs of nodes. We keep those that violates the iUNA, denoted P (line 2). If there is at most one pair in graph G cc that violates the iUNA, we keep the graph as it is (line 4). Otherwise, we initiate an SMT solver (line 5).
For each node, we introduce a integer variable. We encode two hard clauses to ensure the values to be between 0 and M in the model m. These integer variables will eventually be assigned an integer value in the model m after solving.
Next we explain how the soft clauses are generated. For each pair (s, t) in P , we obtain a clause NOT(I s = I t ) and associate it with a weight according to the weighting scheme w (line 10). Instead of taking all the edges of G cc , we take the edges of its minimum spanning forest and a small sample of the edges to reduce the load on the SMT solver. In line 11, we obtain the minimum spanning forest F . For efficiency, we keep a set of edges in B (line 12) for the back propagation process of SMT's internal algorithm design. The edges of F ∪ B forms the set of edges in G cc to examine this round (line [11][12][13][14]. Recall that in Section 4.3, our analysis showed that it provides relatively reliable information when considering redirection and equivalence under different encoding. We therefore encode the edges of the redirection (line 15-18) as soft clauses. The undirected graph is used for the checking of convergence of redirection of two entities (line 15, 17).
While not every soft clause is true in the model, all the hard clauses must be satisfied. The goal is to maximise the sum of weights over all soft clauses while satisfying all the hard clauses. Note that if an SMT solver fails to get an optimal solution within the timeout, it will return the best sub-optimal solution (line 21). The edge (s, t) remains if and only if I s equals I t in the model m (line 22).
The weighting scheme w consists of a series of functions that map clauses to weights: w = (f G , f R , f E , f P ). We used the training dataset to fine-tune the weighting scheme. For a soft clause c e corresponding to an edge e, the weight is f G (c e ) + f R (c e ) + f E (c e ) + f P (c e ). The first weighting scheme w 1 consists of four functions: f G assigns the clause of each edge in the F ∪ B a weight of 5, the rest 0; Similarly, f P assignes the clauses corresponding to pairs in P a weight of 2. f R and f E both increase the weight by 1 for that of G ′R cc and G E c c respectively. After some manual tuning, we provide an alternative weighting scheme w 2 with the corresponding values being 31, 16, 5, and 5, respectively. Other parameters and hyper parameters were set according to Section 4.1 and fine-tuned. The upper bound M was set to 2+|G cc |/50. A random selection of 12% of the edges from the original graph were kept in B. Finally, based on our experience with Z3, the timeout bound for SMT solving was set to (|G cc |/100 + 0.5) second.

Implementation
We used the networkx Python package 12 for the computation of the connected components and the minimum spanning forests. For the manual annotation of the entities, we used ANNit 13 . We used the implementation of the Leiden algorithm and the Louvain algorithm in CDlib 14 . As for SMT solver, we employed Z3 15 and used its Python binding [5]. We published all the code as an open source project 16 . All our experiments were conducted on the LOD Labs machine. It has 32 64-bit Intel Xeon CPUs (E5-2630 v3 @ 2.40GHz) with a RAM of 264GB.

Evaluation Metrics
While precision and recall are commonly used in evaluation metrics [17], the presence of 'unknown' annotations makes them less suitable for this task since no edge involving entity of 'unknown' counts toward precision or recall. Thus, precision and recall do not adequately capture the qualities. Moreover, we noticed that 11 graphs in our gold standard has no erroneous edges except those with nodes labelled "unknown". Therefore, we provide an additional metric. In its design, we focus on two properties that the equivalence classes should possess within the CCs resulting from refinement: (a) the equivalence class should not be separated over multiple CCs; (b) two equivalence classes should not share the same CC. This leads to the following metric for the graph G ′ that results from applying a refinement algorithm to G: Here, C iterates over all connected components in G ′ , and E(C) is a partitioning of the nodes in C by equivalence class, so that Q always represents the set of nodes within a given C that refers to the same real-world entity e. V represents the total number of vertices, and O e is the set of all entities in G ′ referring to e.
Within the summation, there are three factors. The first, |Q e |/|V | is the proportion of the current set of vertices to the total. This turns Ω(G ′ ) into a weighted sum over all subsets |Q|, with the weights summing to the total proportion of nodes not annotated "unknown". The second, |Q e |/|O e |, is 1 if all references to e are in C, and lower if there are more references in other connected components. This penalizes deviating from (a). The third, |Q e |/|C|, is 1 if all nodes in C refer to e and lower if the connected component is shared with nodes referring to other entities. This penalizes deviating from (b). Note that if the graph contains no "unknown" nodes, the max. of Ω is 1.

Evaluation Results
We compare our algorithm using two variants of sources (implicit label-like and comment-like sources) with two weighting schemes (w 1 and w 2 , as defined in Section 5) against the Louvain algorithm [6], the Leiden algorithm [1], as well as the result of MetaLink with two threshold values [3,16]. Table 4 presents the results of the average of 5 runs for each method with best results highlighted. The Louvain algorithm removes the most amount of edges. It has the highest recall but relatively low precision. Recall the example in Figure 2, the results of Louvain can be smaller isolated components. This problem also exhibits in our evaluation, due to the significant amount of edges removed, its Ω values are low despite varying its resolution parameter from 0.01 to 1.0. Compared with Louvain, the result of the Leiden algorithm shows obvious improvements. There are fewer edges removed while the precision and Ω have improved for both the training set and the evaluation set. As for Metalink, we run the algorithm with two thresholds: 0.9 and 0.99 (only links with an error degree higher than the threshold are considered erroneous). There are fewer edges removed in both cases, with higher Ω values compared against that of Leiden and Louvain.  In almost all cases, using comment-like sources results in better precision values while having fewer edges removed. The difference of Ω between using label-like sources and comment-like sources is minor. In general, fewer links were removed when using the UNA and Metalink for refinement. Comparing the nUNA with the iUNA, we can see that using the nUNA results in more edges removed with a lower precision. When comparing the qUNA with the iUNA, we find as well that the qUNA removes a larger amount of edges, which leads to a slightly higher recall. In almost all settings, using the iUNA results in higher precision, which could be the benefit of better modeling using exceptions. The best Ω values in both sets are obtained using the qUNA, while using the iUNA results in better precision with similar Ω values. Compared with Metalink, our algorithm shows higher precision and better Ω values. Overall, our evaluation indicates that different algorithms have different advantages, but using the UNA shows clear benefits.
As for time efficiency, the Louvain and Leiden algorithm completes processing both the training and evaluation sets within 40 seconds. For the algorithm using the UNA, it takes around 8 minutes to process the training set in contrast to up to 27 minutes for the evaluation set. In addition, we note that up to three graphs in the evaluation set can suffer from timeout using our algorithm 17 . When there is a timeout, the SMT solver returns a sub-optimal solution. Our manual examination shows that some "harder" and larger graphs were distributed to the evaluation set when constructing the two sets.

Discussion and Future Work
In this paper, we studied three definitions of UNA and proposed a UNA-based identity refinement approach. RQ1 was answered by defining the iUNA that considers certain exceptions that are common in large integrated graphs. For RQ2 and RQ3, we created a gold standard and compared the reliability of iUNA against the qUNA and the nUNA. For RQ4, we proposed an identity refinement algorithm and evaluated its performance on different definitions of UNA.
Strictly speaking, our gold standard is not large enough for an accurate estimate of the error rate of the entire identity graph. Using our sample, we found that among the 3,678 erroneous edges, only 5 entities have multiple label-like or comment-like sources. This indicates that the UNA can be used for refinement but redundancy is not the direct cause of error. This is contrary to the conclusion of [12] (see type 2 error: consistency and conciseness error).
The performance of our algorithm is sensitive to the parameters and hyperparameters. For example, the upper bound for each integer value M can significantly influence the results if too small. Future work includes studying how our algorithm scales with different time limits, automatic tuning of the parameters, and extending the gold standard. The results of some other parametric settings are included in the supplementary material in the repository.
The performance of MetaLink is comparable with the best outcome of our algorithms. However, our analysis shows that no more than 10% edges removed are shared between Metalink and our algorithms in various settings. It could be promising to explore a hybrid approach in future work. Since our evaluation confirms the superiority of the communities detected using the Leiden algorithm compared to Louvain, it is also reasonable to quest how far the results can be improved if MetaLink uses Leiden's outputs for calculating its error degree.
The identity graph we study contains a large number of connected components of size two, as well as two very large connected components. The biggest CC in this dataset has 177,794 entities and 2,849,426 edges (No. 4073). The second biggest has 21,191 entities and 101,269 edges (No. 142063). The rest are significantly smaller with no more than 5076 nodes. Some past attempts using SMT solvers have also discovered the bottleneck in scalability [20,21]. Our initial experiments show that removing the disambiguation entities has some potential to reduce the size of connected components. In future work, we plan to design scalable algorithms following a divide-and-conquer approach for the handling of large connected components using pairs of entities that violate the UNA as heuristics.