LinkAUC: Unsupervised Evaluation of Multiple Network Node Ranks Using Link Prediction

,


Introduction
It is well-known that network nodes can be organized into communities [1,2,3,4] identified through either ground truth structural characteristics or shared node attributes [5,6,7]. A common task in network analysis is to rank all network nodes based on their relevance to such communities, especially of the second type [8,9], which are commonly referred to as metadata groups. Ranking nodes is particularly important in large social networks, where metadata group boundaries can be vague [10,11]. Node ranks can also be used by recommender systems that combine them with other characteristics, in which case it is important to be of high quality across the whole network. Some of the most well-known algorithms that discover communities with only a few known members also rely on ranking mechanisms and work by thresholding their outcome [12,13].
Node ranks for metadata groups are a form of recommendation and their quality is usually (e.g. in [14]) evaluated with well-known recommender system measures [15,16,17], such as AUC and NDCG. Since calculating these measures requires knowledge of node labels, the efficacy of ranking algorithms needs be demonstrated on labeled networks, such as those of the SNAP repository 1 . How-ever, different algorithms and parameters are more suited to different networks, for example based on how well their assumptions match structural or metadata characteristics. At the same time, large real-world networks are often sparsely labeled, prohibiting supervised evaluation. In such cases, there is a need to evaluate ranking algorithms on the network at hand using unsupervised procedures.
A first take on unsupervised evaluation would be to generalize traditional structural community measures, such as density [18], modularity [19] and conductance [20], to support ranks. However, these measures are designed with structural ground truth communities in mind and often fail to assess hierarchical dependencies or other meso-scale (instead of local) features [6,11,21] that may characterize metadata groups. To circumvent this problem, we propose utilizing the network's structure and the existence of multiple metadata groups; under the assumption that network edges are influenced by node metadata similarity [6], a phenomenon known as homophily in social networks [22], we assess the quality of ranks for multiple metadata groups based on their ability to predict network edges. We show that this practice enriches density-based evaluation and that it agrees with supervised measures better than other unsupervised ones.

LinkAUC
The main idea behind our approach is that, if there is little information to help evaluate node ranks, we can evaluate other related structural characteristics instead. To this end, we propose using node rank distributions across metadata groups to derive link ranks between nodes. Link ranks can in turn be evaluated through their ability to predict the network's edges. An overview of the proposed scheme is demonstrated in Figure 1. In this section, we first justify why we expect node rank quality to follow link rank quality (Subection 2.1) and formally describe the evaluation process of the latter using AUC (Subsection 2.2). We then show that link rank quality enriches density-based evaluation (Subsection 2.3).
LinkAUC: Unsupervised Evaluation of Node Ranks using Link Prediction 3

Link Ranks
Let r i be vectors whose elements r ij estimate the relevance of network nodes j to metadata groups i = 1, . . . , n. Motivated by latent factor models for link prediction [23] and collaborative filtering [24], we consider R = [r 1 . . . r n a matrix factorization of the network. Its rows R j = [r 1j . . . r nj ] represent the distribution of ranks of network nodes j across metadata groups. Following the principles of previous link prediction works [25,26], if network construction is influenced predominantly by structure-based and metadata-based characteristics, this factorization can help predict network edges by linking nodes with similar rank distributions. We calculate the similarities of rank distributions between nodes j, k using the dot product 2 as M jk = R j · R k . These form a matrix of link ranks: Accurate link prediction using link ranks implies good metadata group representations. To empirically understand this claim, let us consider ranking algorithms that can be expressed as network filters f (M ) = ∞ n=0 a n M n [27] of the network's adjacency matrix M , where a n are the weights placed on random walks of length n. For example, personalized PageRank and Heat Kernels arise from exponentially degrading weights and the Taylor expansion coefficients of an exponential function respectively. If applied on query vectors q i , where q ij are proportional to probabilities that nodes j belong to metadata groups i, network filters produce ranks r i = f (M )q i of how much nodes pertain to the metadata groups. Organizing multiple queries into a matrix Q = [q 1 . . . q n ]: This is a quadratic form of f (M ) around the kernel QQ T and, as such, propagates link ranks between queries to the rest of link candidates. Therefore, if queries adequately predict the links between involved query nodes and link ranks can predict the network's edges, then the algorithm with filter f (M ) is a good rank propagation mechanism. At best, queries form an orthonormal basis of ranks QQ T = I and this process can express any symmetric link prediction filter [25,26,28] by decomposing it to f (M )f T (M ).

Link Rank Evaluation using AUC
When evaluating link ranks, it is often desirable to exclude certain links, such as withheld test edges or those absent due to systemic reasons (e.g. users may not be allowed to befriend themselves in social networks). To model this, we devise the notion of a network group that uses a binary matrix M to remove non-comparable links of the network's adjacency matrix M by projecting the latter to M M, where is the Hadamard product performing elementwise multiplication. A robust measure that compares operating characteristic trade-offs at different decision thresholds is the Area Under Curve (AUC) [29], which has been previously used to evaluate link ranks [26]. When network edges are not weighted, if T P R(θ) and F P R(θ) are the true positive and false positive rates of a decision threshold θ on vec M ( M ) predicting vec M (M ), the AUC of link ranks becomes: This evaluates whether actual linkage is assigned higher ranks across the network [30] without being affected from edge sparsity. These properties make LinkAUC preferable to precision-based evaluation of link ranks, which assesses the correctness of only a fixed number of top predictions [26].

Relation to Rank Density
The density of a network is defined as the portion of edges compared to the maximum number of possible ones [31,32]. Using the notion of volume vol(M ) to annotate the number of edges in a network with adjacency matrix M , the density of its projection inside the network group M becomes D M (M ) = vol(M M) vol(M) . We similarly define rank density by substituting the volume with the expected volume vol(M, r) of the fuzzy set of subgraphs arising from ranks being proportional to the probabilities that nodes are involved in links: where · 1 is the L1 norm, calculated as the sum of vector elements, and v are binary vectors of vertices sampled with probabilities r. We first examine the qualitative relation between link ranks and rank density for a single metadata group R = LinkAUC: Unsupervised Evaluation of Node Ranks using Link Prediction 5 where T P and P denote the number of true positive and positive number of thresholded link ranks respectively. At worst, every new positive link after a certain point would be a false positive. Using the big-O notation this can be written as ∂F P R(θ) ∂P (θ) ∈ O(1) and hence: We next consider the case where discovered ranks form non-overlapping metadata groups, i.e. each node has non-zero rank only for one group. This may happen when query propagation stops before it reaches other metadata groups.
, similarly to before: This averages group densities and weights them by vol(M, r i ) r i 2 1 . Hence, when metadata groups are non-overlapping, high LinkAUC indicates high rank density.
Finally, for overlapping metadata groups, LinkAUC involves inter-group links in its evaluation. Since averaging density-based evaluations across groups ignores these links, LinkAUC can be considered an enrichment of rank density in the sense that it bounds it when metadata groups do not overlap but accounts for more information when they do.

Experiments
To assess the merit of evaluating node ranks using LinkAUC, we devise a series of experiments where we test a number of different algorithms on several ranking tasks of varying degrees of difficulty across labeled networks. We use the ranks produced by these experiments to compare various unsupervised measures with supervised ones; the latter form the ground truth unsupervised measures need reproduce, but would not be computable if node labels were sparse or missing.
For every network, we start with known binary vectors c i , whose elements c ij show whether nodes j are members of metadata groups i. We use a uniform sampling process U to withhold a small set of evaluation nodes eval i ∼ U (c i , 1%) and edges (eval i × eval i ) M that -by merit of their small number-do not significantly affect the ranking algorithm outcomes. We also procure varied length query vectors q i ∼ U (c i − eval i , f ) that serve as inputs to the ranking algorithms, where their relative size compared to the group is selected amongst f ∈ {0.1%, 1%, 10%}. Depending on whether query nodes are adequately many or too few, we expect algorithms to encounter high and low difficulty respectively.

Networks
Experiments are conducted on three networks; a synthetic one constructed through a stochastic block model [33] and two real-world ones often used to evaluate metada group detection; the Amazon co-purchasing [34] and the DBLP author co-authorship networks. These networks were selected on merit of being fully labeled, hence enabling supervised evaluation to serve as ground truth. They also comprise multiple metadata groups and unweighted edges needed for LinkAUC. The stochastic block model is a popular method to construct networks of known communities [35,36], where the probability of two nodes being linked is determined by which communities they belong to. Our synthetic network uses the randomly generated 5 × 5 block probability matrix of Figure 2 with blocks of 2K-5K nodes. The Amazon network comprises links between frequently copurchased products 3 that form communities based on their type (e.g. Book, CD, DVD, Video). We use the 2011 version of the DBLP dataset 4 , which comprises 1.6M papers from the DBLP database, from which we extracted an author network based on co-authorship relations. In this network, authors form overlapping metadata groups based on academic venues (journals, conferences) they have published in. To experiment with smaller portions of query nodes and limit the running time of experiments, we select only the metadata groups with ≥ 5K nodes for the real-world networks. A summary of these is presented in

Ranking Algorithms
We use both heuristic and established algorithms to rank the relation of network nodes to metadata groups. Our goal is not to select the best algorithm but to obtain ranks with many different methods and then use these ranks to compute the evaluation measures to be compared. The considered algorithms are: PPR [12,38]. Personalized PageRank with symmetric matrix Laplacian normalization arising from a random walk with restart strategy. It iterates r i ← aD −1/2 M D −1/2 r i + (1 − a)q i , where D is the diagonal matrix of node degrees. Throughout our experiments, we select the well-performing parameter a = 0.99. PPR+Inflation [13]. Adds all neighbors of the original query nodes to the query to further spread PPR.
PPR+Oversampling [39]. Adding nodes with high PPR ranks to the query vector before rerunning the algorithm.
HK [40]. Heat Kernel ranks obtained through an exponential degradation filter This places higher weights on shorter paths instead of uniformly spreading them across longer random walks. Hence, it discovers denser local structures at the cost of not spreading ranks too much. We selected t = 5 and stopped iterations when (D −1/2 M D −1/2 ) k q i converged. HPR. A heuristic adaptation of PPR that borrows assumptions of heat kernels to place emphasis on short random walks r i ← t k a(D −1/2 M D −1/2 − I)r i + (1 − a)q i , where k is the current iteration, t = 5 and a = 0.99.

Measures
The following measures are calculated for the node ranks of metadata groups produced in each experiment. We remind that, when network labels are sparse, supervised measures that serve as the ground truth of evaluation may be inapplicable. Unsupervised measures other than LinkAUC are computed on the training edges, as the sparsity of withheld group members eval i does not allow meaningful structural scores. LinkAUC, on the other hand is applicable regardless of the evaluation edge set's sparsity. To avoid data overlap between rank calculation and evaluation, which could overestimate the latter, supervised measures and LinkAUC use only the test group members and edges.

Unsupervised Measures
Conductance -Compares the probability of a random walk to move outside a community vs. to return to it [41]. Using the same probabilistic formulation as for rank density we define rank conductance: where C = 1 is a max-probability parameter. (Comparisons are preserved for any value.) Lower conductance indicates better community separation. Gap Conductance -Conductance of binarily cutting the network on the maximal percentage gap between rij degree(j) for each community i [42,43]. We use this as an alternative to sweeping strategies [12,13], which took too long to run. Density -The rank-based extension of density in (4). LinkAUC -AUC of links ranks calculated through (1), where columns are divided with their maximal value and then each node's row representation is L2normalized, making link ranks represent cosine similarity between edge nodes. This is our proposed unsupervised measure. Supervised Measures (Ground Truth) NodeAUC -AUC of node ranks, averaged across metadata groups i. NDCG -Normalized discounted cumulative gain across all network nodes. For this non-parametric statistic, ranks derive ordinalities ord[j] for nodes j (i.e. the highest ranked node is assigned ord[j] = 1). For each metadata group i, assigning to nodes j relevance scores of 1 if they belongs to it and 0 otherwise: NDCG is usually used to evaluate whether a fixed top-k nodes are relevant to the metadata group. However, we are interested in evaluating the relevant nodes of the whole network and hence we make this measure span all nodes. This makes it similar to AUC in that values closer to 1 indicate that metadata group members are ranked as more relevant to the group compared to non-group members. Its main difference is that more emphasis is placed on the top discoveries.

Results
In Figure 3 we present the outcome of evaluating different algorithms on the various experiment setups, i.e. tuples of networks, seed node fractions and ranking algorithms. Each point corresponds to a different unsupervised (vertical axes)supervised (horizontal axes) measure pair calculated for a different experiment setup (i.e. combination of seed node sizes and ranking algorithms) and is obtained by averaging the measures across 5 repetitions of the setup. Unsupervised measures are considered to yield descriptive evaluations when they correlate to supervised ones for the same network (each network is involved in 15 experiment setups arising from the combination of |f | = 3 different seed node sizes with one of the 5 different ranking algorithms).
We can see that LinkAUC is the unsupervised measure whose behavior most closely resembles that of the supervised ones. In particular, Table 2 shows that LinkAUC has a strong positive correlation with NodeAUC and a positive correlation with NDCG for all three networks, outperforming the other metrics in all but one experiments. To make sure that these findings cannot be attributed to non-linear relations with other measures, we confirm them using both Pearson and Spearman correlation, where the latter is a non-parametric metric that compares the ordinality of measure outcomes. The slightly weaker correlation of LinkAUC with NDCG can be attributed to the latter's tendency to place more emphasis on the top predictions, which makes it overstate the correctness of rank quality compared to AUC when the rest of ranks are inaccurate.
Looking at the other unsupervised measures, fuzzy definitions of conductance and density sometimes degrade for higher NodeAUC values. This can be attributed to these metrics measuring local-scale features, which are not always a good indication of the quality of larger metadata groups. It must be noted that gap conductance also exhibits strong correlation with the supervised measures on the real-world networks. However, especially for the synthetic network, it frequently assumes a value of 1 that reflects its inability to discover clear-cut boundaries. This sheds doubt on the validity of using it for evaluating ranks in new networks, since similar structural deficiencies can render it uninformative.

Conclusions and Future Work
In this work we proposed a new unsupervised procedure that evaluates node ranks of multiple metadata groups based on how well they predict network edges. We explained the intuitive motivation behind this approach and experimentally  showed that it closely follows supervised rank evaluation across a number of different experiments, many of which are inadequately evaluated by other unsupervised community quality measures. Based on our findings, our approach can be a better alternative to existing rank evaluation strategies in unlabeled networks whose metadata propagation mechanisms are unknown. This indicates that network structure and awareness of multiple metadata groups are two promising types of ground truth that can help evaluate metadata group ranks.
In the future, we are interested in performing experiments across more networks and compare our approach with additional unsupervised measures.