Bat-Cluster: A Bat Algorithm-based Automated Graph Clustering Approach

ABSTRACT


INTRODUCTION
Graphs enable us to visualize connected data and to rely on visual prowess to decipher valuable and important hidden knowledge, which could be used to improve the decision-making process of an organization. Visualizing large graphs based on the continuously growing amount of available data has become a very complex task and has outpaced the human"s ability to process, analyze, visualize and even, understand them. Therefore, a process of reducing large graphs into smaller, more representative ones is needed. To address this challenge, graph clustering has imposed itself lately as a promising research area.
Graph clustering can be defined as the problem of collecting similar nodes into same groups called "clusters." It is a widely known technique with applications in various fields such as social media [1], Web search results optimization [2], wireless sensor networks [3] and also in biochemical neural networks [4] among others. In most cases, the number of clusters to form is already known and is given as an entry to the clustering algorithm. However, with the prevalence of Big Data, it became harder to have a prior idea on the number of clusters. This also applies to the large graph clustering where the decider"s visual prowess is not sufficient enough to provide him with an approximate prior idea on the eventual number of clusters. Therefore, it became imperative to propose solutions where the clustering algorithm can automatically "guess" the correct number of clusters before proceeding with the clustering operation. This research field, called "Automatic Graph Clustering," started in the late 1990"s but couldn"t blossom until the late 2000"s early 2010"s with the introduction of the artificial intelligence concepts such as nature-inspired algorithms [5], [6].

1123
In most of the papers in the literature related to automatic graph clustering based on nature-inspired algorithms, the main idea is to define a "Similarity Measure," then set the clusters according to it. Several papers adopted a discrete formulation of this and proposed adaptations of the basically continuous nature inspired algorithms to solve it.
In the present work, we will proceed differently since we will adapt the graph clustering problem itself so it can be represented as a continuous problem. This adaptation will be using "Bat-Cluster," a combination between "FFDP," a large graph visualization algorithm we developed by our team [7], and "Bat Algorithm," a nature inspired optimization algorithm developed by Xin-She Yang [8] based on the behavior of bats. FFDP will set an equilibrium positioning of the large graph; then it will provide the nodes final positions as a vector of coordinates. Bat algorithm will take this vector into consideration and try to find the best clustering configuration possible.
After reviewing the related works in Section 2, we will describe, in Section 3, the similarity measure we used as an objective function and we will describe how it will be optimized by the "Bat-Cluster" (BC) algorithm.
The testing and results of the clustering provided by "Bat-Cluster" compared with other well-known solutions, such as PSO, Differential Evolution and Ant Colony Optimization, will be discussed in Section 4. Section 5 concludes the paper and presents an idea of our future works.

RELATED WORKS
In this section, we will explore some of the most important nature inspired solutions used to answer the issue of automated graph clustering before moving to introducing Bat-Cluster in Section 3.

Particle Swarm Optimization
The literature contains several approaches to using PSO in graph clustering, often referred to as "Community Detection." Most of these approaches are based on the idea of adapting the PSO, an algorithm originally designed to solve continuous optimization problems so that it would be able to solve discrete problems. Cai et al. proposed in [9] and [10] an alteration of the definition of the position and the velocity terms where the position vector represents a partition of a signed network and the velocity represents an eventual permutation of the partition. Suganthi and Rajagopalan [11] have applied PSO in its continuous state, but they suggested using a multiple population swarm instead of using the standard PSO with one population. Rejina Parvin and Vasanthanayaki [12] used PSO to prevent residual nodes in wireless sensor networks (nodes that don"t belong to any cluster). Their idea has been applied to optimize energy consumption, throughput, packet delivery ratio, and network lifetime of the wireless sensor networks.

Ant Colony Optimization
Mandala et al. [13] proposed an ACO based technique for graph clustering and applied it in detecting customer communities in the e-marketing field. Ji et al. [14] suggested a solution for the problem of complex community detection in large graphs based on the strategy of ant pheromone diffusion and update to search for an optimal graph partitioning. Zhou et al. [15] followed a similar process, but they took the overlapping issue of the large communities into consideration. Moradi and Rostami [16] used ACO along with feature selection to define clusters of features. Gao et al. [17] proposed a combination between ACO and K-Means as a solution to the dynamic location routing problem. K-Means is used to define the location of depots (cluster centers) while ACO is utilized to handle the VRP in dynamic environments.

Differential Evolution
Paterlini et al. [18] proposed a direct application of DE to solve the problem of graph partitioning and a comparative study with the Genetic Algorithm (GA) showed that DE was more efficient. Cai et al. [19] proposed an adaptation of DE inspired by the imitation of the phenomenon of social learning in animal societies. They improved the traditional DE by introducing the strategic ASL selection. It allows the algorithm to rely on the information extracted from the neighborhood relationships of its population individuals to guide the selection of the eligible parents for the crossover. Hybridization attempts of DE with other algorithms can be found in recent literature. For instance, Zorarpaci and Özil [20] suggested a combination between DE and the Artificial Bee Colony algorithm and applied it to solve the problem of feature selection.

PROPOSED SOLUTION : "BAT-CLUSTER" 3.1. Objective Function
The objective function for the algorithm is the quality measure that will help it decide what clustering configuration is the best. Nanda and Panda [21] provided a list of several clustering quality metrics available in the literature. What we want is a clustering able to highlight, on the one hand, the closeness between similar nodes, and on the other hand, the separation between different nodes. Therefore, the distance should have a fundamental role in choosing our quality metric. However, relying on the distance from the cluster center alone as in the traditional K-Means, or the distance between cluster centers may not be sufficient.
We need a metric able to provide a combination of these two metrics so that it would assure that the similar nodes are close to each other and far from the nodes that are different from them.
One of the most popular metrics in the literature is called "DBIndex" [22]. It was developed by Davies and Bouldin, and it provides a ratio between the intra-cluster distance (the distance between the nodes in the same cluster) and the inter-cluster distance (the distance between the centers of each cluster).
DBIndex is defined as: According to Davies and Bouldin [22], a correct clustering minimizes the DBIndex as depicted in Equation (1). That being said, the objective function for our clustering algorithm should be: To solve Equation (2), we propose a hybridization of the standard Bat Algorithm by Yang [8] with the FFDP algorithm that we developed in a previous work [7]. We chose to call this hybridization "Bat-Cluster," or BC.

Bat-Cluster
Bat-Cluster, or BC, is a combination of two algorithms, FFDP and Bat Algorithm. FFDP will run first to set an optimized equilibrium positioning of the nodes of the graph. These node positions will then be assigned to the Bat Algorithm.
BA will start by generating a population of bats. Each of these bats will have its own initial loudness, pulse rate, position and velocity. The initial bats positions represent the initial cluster centers. When the algorithm starts running, each bat will be assigned to a cluster center location. For each cluster 1125 center, the algorithm will calculate the mean value of the closest nodes to it. The cluster center"s position is then updated, and the objective function is calculated as in the Equation (2). If the value of the objective function has converged, we return the cluster center locations; otherwise, we reassign each bat to the corresponding cluster center once again.
If the random value rand is greater than the bats pulse rate, the algorithm selects a solution among the best solutions and generates a local solution around the selected best solution. If the random value rand is smaller than the loudness i A , and the value of the objective function for the current bat position is better (smaller in our case) than the value of the best solution found so far, accepted, the bats pulse rate is increased, and the loudness is decreased. The solutions found are sorted, and the current best solution is stored. The algorithm keeps running until the stop criterion is respected. In our case, the algorithm should stop if the iteration number t becomes equal to the maximum number of iterations M .
The pseudo code of the Bat-Cluster algorithm will be then described as follows: () We will use the continuous aspect of all these algorithms, and the function to optimize will be the DBIndex as depicted in the Equation (2). This approach will enable us to compare the performances of these algorithms on an equal foot.

Benchmark Graphs
The graphs that we will use in our tests are three benchmark graphs of different sizes and come from different domains. These graphs are available in the Gephi standard dataset accessible in the following link: https://github.com/medialab/benchmarkForceAtlas2/blob/master/dataset.zip  Table 1 displays the layouts of the 4 benchmark graphs.

Parameters Setting
Defining the correct parameters for a nature inspired algorithm, in general, requires rigorous prior testing. The same goes for BC and all of the algorithms that we will test it against.
After several experiments, the parameters we found able to answer our needs correctly are the following: a.

Experimental Results
The Table 2 to Table 5 show the performances of the Bat-Cluster compared with each of the other aforesaid algorithms on the four benchmark graphs. In "facebook_ego_686", Bat-Cluster provided the smallest optimal value for the DBIndex, closely seconded by PSO. Yet, BC was the only algorithm able to provide three clusters while the other algorithms provided only 3 clusters. The results in the "yeast" graph can be debatable at first. Indeed, based on the DBIndex alone, we will say that BC was the best, but seeing that PSO was able to provide more clusters can open the possibility that PSO may be able to find a better value for the DBIndex. However, according to the evolution of the best values as depicted in Figure 1, we will see that PSO started stagnating after the iteration 100 in a DBIndex value higher than the one provided by BC. This concludes to the fact that having 5 clusters may not be the best clustering scenario.  Figure 1. The evolution of the best DBIndex values provided by PSO for the "yeast" graph In the "arxiv_general_relativity" graph, BC gave the smallest values of DBIndex and much more clusters (6 against 4 provided by the first runner-up PSO). Regarding the "oregon2_010331" graph, BC and PSO were able to provide 3 clusters, while the other two could only provide 2 clusters. The DBIndex values of BC and PSO were very close, with a small advantage for BC.
Overall, Bat-Cluster was able to provide the best values of DBIndex on all the benchmark graphs. Being closely seconded by PSO shows the ability of the Swarm Optimization algorithms to tackle this kind of problems. However, the results provided by ACO were poorer than expected. When we look at the evolution of the best value provided by ACO on "facebook_ego_686," as in Figure 2, for example, we can see that the algorithm kept finding better values. This can be explained by the fact that the configuration we gave to ACO may probably not be the best.

CONCLUSION
This paper presented the Bat-Cluster (BC) algorithm. It is a combination of the FFDP algorithm developed by our team [7] and the Bat Algorithm developed by Xin-She Yang [8]. BC is an algorithm designed to answer the need for automated large graph clustering. In contrast with several clustering algorithms available in the literature, BC was able to translate the automated large graph clustering issue into a continuous problem, while the other solutions tend to formulate it as a discrete problem. The idea here was to run a large graph layout algorithm, the FFDP, and make it provide the coordinates of the equilibrium positions of the graph"s nodes. Having these coordinates enabled us to translate the graph to a standard real valued vector easily solvable with the continuous version of the Bat Algorithm. The quality metric we used to measure the quality of our clustering was the DBIndex by Davies and Bouldin [22]. The Bat-Cluster algorithm was tested on four benchmark graphs of different sizes and from different domains. BC proved to be a good alternative solution to solve the automated large graph clustering problem when compared to algorithms considered among the best in the literature.
The Bat-Cluster algorithm will be integrated into XEWGraph [23], the large graph visualization service of the Competitive Intelligence tool Xplor EveryWhere [24]. Coupled with the out of the box categorization provided by XEWGraph"s hypergraph approach, BC will enable the user to have large graphs clustered and expanded on demand for both the web and the mobile oriented interfaces of XEWGraph.