Random-Walk Graph Embeddings and the Influence of Edge Weighting Strategies in Community Detection Tasks

Graph embedding methods have been developed over recent years with the goal of mapping graph data structures into low dimensional vector spaces so that conventional machine learning tasks can be efficiently evaluated. In particular, random walk based methods sample the graph using random walk sequences that capture a graph's structural properties. In this work, we study the influence of edge weighting strategies that bias the random walk process and we are able to demonstrate that under several settings the biased random walks enhance downstream community detection tasks.


INTRODUCTION
Over the past few years, there has been a notable increase in the volume of data produced and exploited by applications and services that handle various types of networks. Most of these networks, such as citation networks, sensor networks and, most notably, social networks, can be naturally modelled through graph data structures, with the networks' entities and relationships being represented by a graph's nodes and edges respectively. Subsequently, by performing graph analytics tasks, such as node classification [2], link prediction [15], and community detection [9], we can discover inherent Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. OASIS ' characteristics of the network's nature and gain additional insight regarding the relationships of its entities. For instance, community detection tasks in social networks [11] can be used to enhance the targeting of marketing campaigns, recommendation systems, the identification of criminal groups, and more [19].
Recently, graph embedding methods that provide a latent representation of the graph data in a low-dimensional vector space have been developed. These methods employ the graph's components (nodes, edges, and features or attributes) and produce a mapping into an embedding space that targets to preserve the graph's topology and overall structural properties (such as the pairwise distance between nodes). The resultant graph embeddings can then be utilized for analytics tasks that are based on conventional machine learning mechanisms (e.g. executing the -means algorithm to obtain a partition of the graph's nodes).
Graph embedding methods that map graph nodes to vector spaces can be categorized into three types [7]: (i) matrix factorization methods, (ii) deep learning methods, and (iii) methods based on random walks. Factorization methods attempt to decompose the graph's adjacency matrix into eigenvectors and eigenvalues, while deep learning methods employ multi-layer architectures to capture structural similarity between nodes. Finally, random walk methods sample node sequences by executing random walks among the graph's nodes and adopting the intuition that similar nodes will tend to coexist in several of the sampled sequences.
The two most prominent random walk based methods are Deep-Walk [20] and node2vec [8]. The DeepWalk method samples a number of fixed-length random walks from each graph node which are then supplied as input to the skip-gram model of the word2vec word embedding technique [17,18]. The skip-gram model learns vector representations such that words with a similar meaning in a corpus will end up closer in the embedding space, while less similar words will end up further apart. DeepWalk intuitively uses a "corpus" of sampled sequences so that nodes that frequently appear together in a random walk (given a context window of a user-defined size) are characterized by a small distance in the final embedding.
Node2vec [8] builds upon the core idea of DeepWalk with the main difference being the induction of bias in the random walk process. In particular, in each transition during a random walk, node2vec adds bias to the transition probabilities of the node's neighbors according to two user-defined parameters and . Parameter defines the tendency of a random walk to follow a Breadth-First-Search approach, while parameter enables a Depth-First-Search approach to the random walk.
In this work, we focus on random walk methods and study the utilization of edge weighting strategies as a means of inducing bias to the random walk generation phase. Edge weighting strategies recalibrate and modify the edge weights of a graph with the end goal of enhancing a particular downstream analytics task. To the best of our knowledge, this work constitutes the first attempt at enhancing specifically the community detection downstream task by utilizing edge weighting strategies that attempt to guide the random walks into having predominantly members that belong in the same community. The experimental evaluation showcases that, for a variety of configurations, our approach yields more accurate and coherent results than those executed on graph embeddings derived from state-of-the-art random walk embedding methods.

FRAMEWORK
We begin by providing an outline of the broad framework and the proposed methodology before discussing the individual edge weighting strategies and their overall rationale.

Outline
Given an unweighted graph = ( , ), where and correspond to the graph's node and edge set, respectively, the objective is to provide a graph embedding that enhances community detection tasks performed by typical machine learning techniques. Thus, we employ weighting strategies that reweight edges between nodes according to a perceived likelihood of the nodes belonging to the same community.

Algorithm 1 DetectCommunities( , S, )
Input: unweighted graph , weighting strategy S, number of communities Output: community designations CD for all nodes in 1: The outline of our framework can be seen in Algorithm 1. Initially, we reweight the graph according to a weighting strategy S and obtain the weighted graph ′ . After obtaining the embedding ′ using the node2vec algorithm we execute the -means algorithm on the embedding to obtain community designations for each node in .
Note that Algorithm 1 is an indicative description of the overall framework and the implementation details such as the graph embedding technique (e.g. DeepWalk, node2vec, etc.) or the community detection algorithm (e.g. -means, GMM [3], etc.) may vary depending on the graph domain or the application requirements. In this work, we opted for the combination of node2vec and -means on account of their well-established practicality and applicability.

Edge Weighting Strategies
Most of the edge weighting strategies presented in this work focus on enhancing algorithms based on community detection through modularity maximization. Additionally, they attempt to handle the resolution limit problem [6] that exists in modularity maximization approaches. In the remaining of the section, we use an edge between two nodes and as a running example. The four wellestablished and effective methods presented in this work are: EBC_CNR The "EBC_CNR" method [12] weights a graph's edges according to two measures: their edge betweenness centrality (EBC) and common neighbor ratio (CNR). EBC corresponds to the number of shortest paths that go through while CNR reflects the percentage of common neighbors shared between nodes and . The exact weight of the edge is contributed by both EBC and CNR through two parameters and that are defined in either a manner that attempts to maximize the variance of the weight distribution or through heuristics. Thus, the weight of is: where , > 0, is the adjacency matrix of the graph, is the normalized EBC of , and is the CNR between nodes and . SimRank The "SimRank" approach is based on the SimRank similarity measure [10] which states that "two objects are similar if they are related to similar objects". SimRank scores each node pair based on the structural functionality or purpose they exhibit in the whole graph. Conceptually, in its iterative form the SimRank score ( , ) between two nodes and in the -th iteration of computation is equal to: where ( ) corresponds to the neighbor set of , ( ) refers to a particular neighbor of and signifies a decay constant. Additionally, 0 ( , ) = 1 if = and 0 ( , ) = 0 otherwise. The weight of an edge in the graph is set equal to the SimRank score between the edge's two endpoints. -path The " -path" method is based on the calculation of the -path edge centrality measure [5] along with additional operations [4]. The -path edge centrality measure assigns weights to the edges according to their centrality and is defined as: being the number of simple random paths of at most nodes initiating from that pass through and being the number of simple random paths of at most nodes that originate from . Finally, the weight between two nodes is set equal to the Euclidean distance of their -path centrality measures 1 : where ( ) is the degree of node . AdaptiveMM Finally, the "AdaptiveMM" approach [16] follows a three step approach to generating weights for an unweighted graph. At first, an artificial network is generated with topological characteristics that resemble the original graph. This artificial graph is equipped with generated ground truth communities and is then used as a basis for extracting a selection of local topological features from each edge such as the difference in clustering coefficients of the edge's endpoints or the Adamic-Adar index [1]. In the last step, the edge features are supplied as input to a regression model that weights the edges in a way that a modularity maximization approach would be able to efficiently detect the ground truth communities of the artificial network.
Even though our approach is not related to the problem of modularity maximization in its general form, the methods presented above can be used as intuitive heuristics for the purpose of assigning significant weights to nodes that could potentially exist in the same community.

EXPERIMENTAL EVALUATION
In this section, we conduct experimental evaluation on the framework presented in Section 2 against the baselines of DeepWalk and node2vec. Since node2vec performed better than DeepWalk in all the experiments, we regard node2vec as the highest performing baseline. We begin by discussing implementation details before presenting the results on both synthetic and real-world datasets.

Implementation Details
Initially, we begin with an unweighted graph that is assigned weights through an edge weighting strategy. Following that, the graph embedding is obtained using node2vec and the -means algorithm is executed to obtain the final communities.
During the weighting process some existing edges may end up with a zero weight and this may affect the random walk sampling process in two ways. In the first case, edges with zero weight are assigned a weight equal to the smallest weight of the neighbors of the node being traversed divided by their total count. In the second case, if all the edges of a node's neighbors have zero weight, then they are all equally probable to be selected during a random walk.
The parameters used in the node2vec technique are = 128, _ ℎ = 80, = 10 (number of walks from each node), _ = 10. The parameters and were evaluated for each dataset using a grid search over [0.25, 0.5, 1, 2, 4] as per the suggestions in [8]. In the " -path" method we set to 20 and in the "SimRank" method we set to 0.8. In the case of "EBC_CNR" the parameters and were evaluated explicitly for each dataset using the heuristics described in [12]. All values selected above follow the suggestions of the authors in their respective original work.

Synthetic Datasets
We implemented a selection of LFR networks [13] with varying node counts and community sizes, and tested the performance of our framework for different values of the mixing parameter . Table 1 details the synthetic datasets used where is the number of nodes, and are the average and maximum vertex degree,  are the minimum and maximum community sizes, and ∈ {0.25, 0.35, 0.45, 0.55}. The exponent for the degree power law sequence was 2, while for the community size sequence was 3. In each experiment we measure the Adjusted Rand Index (ARI) and Normalized Mutual Information (NMI) measures [21], along with the graph's modularity on the final partition. The ARI and NMI measures are estimated after ten instances of the -means algorithm with different centroid seeds and being equal to the respective's datasets ground truth communities count. Figure 1 presents our results where several observations can be made. "AdaptiveMM" consistently outperforms the rest of the methods and the baselines, while "DeepWalk" and "SimRank" achieve similar effectiveness, but are outperformed by the rest of the methods in the majority of the experiments across all measures. The effectiveness of our framework in the ARI measure increases for graphs with a higher node count. Finally, all methods, except "Sim-Rank", achieve higher modularity than the baselines for < 0.5, (i.e. communities with strong connections where a node has more neighbor nodes inside its' community than the rest of the graph), while " -path" achieves the highest modularity for = 0.55 among all methods. Table 2 summarizes the best results depicted in Figure 1 for each metric in each dataset.

Real-world Datasets
Complementary to the experiments on synthetic datasets, we also performed experiments on real-world datasets equipped with groundtruth communities and, more specifically, a product network and two social networks from the SNAP Dataset Collection [14].
The "ego-Facebook" dataset represents a set of social circles in the Facebook social network. Nodes and edges in this network represent users and friendship relationships between them respectively. The "Amazon" dataset consists of products found in the Amazon website that are linked if they are frequently bought together. Products belong in the same ground-truth community if they are characterized by the same product category defined by Amazon. Similarly to "ego-Facebook", the "Youtube" dataset contains friendship links between users in the video-sharing website Youtube. Ground-truth communities correspond to user-formed group communities. Note that in all three datasets a node may belong to more than one ground-truth community so for the purposes of this experimental evaluation we restrict each node to one ground-truth community assignment and disregard the rest of the assignments. Similarly to the synthetic experiments, we set equal to the respective's datasets ground truth communities count.
In the "ego-Facebook" dataset we omitted nodes without a community assignment and nodes without any edges. In the "Amazon" and "Youtube" datasets we focused on the top 5000 communities with highest quality [23] discarding nodes and edges that were not a member in any of the top 5000 communities while also removing duplicate communities with completely identical members. Table 3 showcases the resulting real-world datasets used in the evaluation. Table 4 details the results of the experimental evaluation on the real-world datasets. The two best performing approaches across all datasets and metrics are "EBC_CNR" and "AdaptiveMM" while in each dataset the best values for each metric are achieved by the same approach. With the exception of the modularity metric in the "Amazon" dataset, each metric is increased in a statistically significant ( < 0.05) 2 improvement by at least one edge weighting strategy. The highest difference is on the "Youtube" dataset where "EBC_CNR" achieves +8.4% higher modularity than node2vec while the lowest difference is on the the "Amazon" dataset where "Adap-tiveMM" and node2vec have nearly identical modularity.
The key observations from the experimental evaluation in both synthetic and real-world datasets are threefold: i) the use of edge weighting strategies generally enhances community detection tasks that are performed on embeddings generated by random-walk graph embedding methods; ii) with the exception of "Simrank", each strategy offers the best performance for at least one dataset and metric combination; and iii) in each dataset the best performances for both the ARI and NMI measures are achieved by the same strategy. Table 4: Experimental evaluation on real-world datasets. The first number in each cell refers to the mean metric value (over ten iterations), while the second number to two standard deviations. The best performance in each metric for each dataset is denoted in bold. Results marked with " * " provide a statistically significant ( < 0.05) increase over the results of node2vec.

CONCLUSIONS
The ubiquitous nature of networks and their ease of representation as graphs has led to several graph analytics tasks that seek to discern information about the characteristics of the network. By mapping graphs into vector spaces, classical machine learning algorithms can be efficiently applied to gain additional insight about a network's features. In this work, we studied random walk graph embedding methods and the influence of edge weighting strategies in community detection. We used four intuitive state-of-the-art strategies and experimentally demonstrated that under several settings the utilization of edge weighting strategies can lead to improved performance according to the ARI, NMI and modularity measures.
Future work could focus on exploring the influence of weighting strategies while using community detection approaches other than k-means. Alternatively, the influence of weighting strategies in other analytics domains (such as node classification or link prediction) constitutes another interesting future work direction.