Fuzzy Community Detection Model in Social Networks

In this paper, a fuzzy graph clustering model is presented to identify overlapping communities in a complex network. A center‐based fuzzy clustering model is developed based on the possibilistic c‐means clustering model, and the distance measure is defined based on the similarity to the clusters’ centers. The performance of the clustering process is evaluated by intra and intercluster density. In addition, experimental results from two artificially generated networks and two real‐world networks (social interactions between karate club members and a part of the twitter network) indicate the new model's performance.


INTRODUCTION
Networks (or graphs) have long been the subject of many studies in the fields of mathematics, sociology, biology, information science, and quantitative geography. 1,2 Graph's structure is formed by a set of nodes (or vertices) and a set of links (or edges), which connect vertices. 3 A group of nodes that probably share common properties and/or play similar roles within the graph is called a community or cluster. 1,4 The task of grouping nodes with edges that are connected to one another, but have no connection to outside the group, is referred to as graph clustering or community detection. 1,3 The graph clustering methods in the literature are divided into two main groups. The first group consists of global methods such as hierarchical clustering, divisive clustering, agglomerative clustering, while the second group includes local clustering methods such as local search and fitness function methods. In global clustering, each node of the input graph is assigned to a cluster in the output of the method, whereas in local clustering, the cluster assignments are only done for a certain subset of nodes, commonly only one node. 3 Community structure is an important feature of real-world networks. 5 Graphs' nodes may be shared among different communities and form overlapping communities. Discovering and detecting overlapping communities, which exist in the most real social networks, is an important topic in social network analysis. 6,7 The number of studies in this area involving either heuristic or local search procedures has increased considerably. 1,5,[8][9][10] In recent studies, some methods and algorithms have been proposed for detecting the overlapping communities, such as the method that uses belief propagation and conflict to occupy communities, 5 modification of Girvan and Newman model, 11 the Q function overlap community detection method, 12 local optimization, 13,14 methods based on fuzzy relations and theory, 3,7,[15][16][17][18] fuzzy c-means (FCM) clustering, 19 and some others. [20][21][22] The fuzzy objective function-based clustering method comprises a family of local graph clustering methods that can be formulated as the problem of minimizing an objective function. These methods assign the nodes to communities with different belonging values and form the overlapping communities. Fuzzy clustering algorithms have been proposed by Dunn 23 and extended by Bezdek. 24 Despite the fact that fuzzy clustering was developed and applied widely to general clustering tasks, little research can be found on fuzzy clustering in graph clustering. 3 In general, the past decade has been quiet concerning the application of fuzzy clustering in this area. 25,26 Although some methods for discovering a fuzzy overlapping community have been presented recently, there is still space for improving their performance and universality. 3,25 The most popular fuzzy clustering model is FCM and is mostly used in combination with other techniques for detecting communities. 19,27,28 The structure of the fuzzy clustering model in these studies is not well adapted for graph clustering, specifically in the determination of the clusters' centers. In this paper, a fuzzy overlapping community detection model and algorithm is developed based on analysis of the semantics of data in social networks.The distance matrix describing the distance between each node in a network is established based on the closeness to the clusters' centers. We conducted experiments on both synthetic and real networks to evaluate our detection algorithm. The real networks included the karate club network and a part of the twitter network.
Our proposed approach has modified the structure of the possibilistic c-means (PCM) clustering model and adapted it for graph clustering. Our experimental results indicate the effectiveness of the proposed approach. This paper is organized as follows. Section 2 contains the different graph distance matrices, the most popular center-based clustering and fuzzy clustering models. The proposed fuzzy graph clustering model is described in Section 3. The applied clustering validity indices are addressed in Section 4. The experimental results are presented in Section 5 to demonstrate the performance of proposed model. Section 6 summarizes the conclusions with suggestions for further research in this area. including crisp clustering model (c-means and c-medoid) and fuzzy clustering models (FCM) and (PCM), are described briefly.

Distance Matrix
The sociomatrix or adjacency matrix is a primary matrix that is used in social network analysis and is donated by A. Graph theories refer to this matrix as an adjacency matrix because the entries in the matrix indicate whether two nodes are adjacent or not. The size of an adjacency matrix, A, is (g is the number of graph nodes) for one-mode networks. The entries in the adjacency matrix, a ij , record which pairs of nodes are adjacent. In the adjacency matrix, if nodes n i and n j are adjacent, then a ij = 1, and if nodes n i and n j are not adjacent, then a ij = 0. 29 In this paper , we are focusing on graphs where the links are not directed and are neither signed nor valued. If a link between two nodes is presented, it goes both from n i to n j and from n j to n i , thus, a ij = 1, and a ji = 1. In other words, the adjacency matrix for a nondirectional relation graph is symmetric, thus a ij = a ji = 1, ∀i, j . 29 Defining or selecting an appropriate similarity or distance function depends on the task at hand. The number of similarity measures used in the literature has been very high. 30,31 Given a data set, a distance measure D ij , should fulfill the following criteria 3 : • The distance from a datum to itself is zero: For points in an n-dimensional Euclidean space, possible distance measured for two data points A i = (a i1 , a i2 , . . . , a ig ) and A k = (a k1 , a k2 , . . . , a kg ) include the Euclidean distance 1 : (a ij − a jk ) 2 (1) which is the L 2 norm, the Manhattan distance: which is the L 1 norm, and the L ∞ norm:

Center-Based Clustering
Cluster analysis is a task of grouping objects with similarity (or relevancy) within a group to one another and difference from (or unrelevancy to) the objects in other groups. 32 The greater the similarity (or homogeneity) within a group and the greater the difference between groups result in better or more distinct clusters. 33 Prototype-based clustering techniques create a one-level partitioning of data objects. There are a number of such techniques, but two of the most prominent are c-means and c-medoid. c-means defines a prototype in terms of centroid, which is usually the mean of a group of points, and is typically applied to objects in a continuous n-dimensional space. 34 If X = {x 1 , x 2 , . . . , x n } ∈ R s is a set of feature vectors (n and s are the number and dimension of data points, respectively), cmeans partitions the data set into c clusters by minimizing the following evaluation function: is prototypical a set of clusters' centers and L is any distance norm. 40 Objective function base clustering can be treated as an optimization problem and solved by the gradient descent technique. 34 If the distances between the objects and clusters' centers are measured by the Euclidean distance L 2 , by setting the differentiation of J with respect to v i , the updating formula for cluster center v i is given as Different distance measures can be applied in clustering objective function. c-medoid objective function minimizes the Manhattan (L 1 ) distance of points from the cluster center as follows: The updating formula for the cluster center is defined by setting the differentiation of (6) to 0 and solving c-medoid defines a prototype in terms of a medoid, which is the most representative point for a group of points, and can be applied to a wide range of data since it requires only a proximity measure for a pair of objects. While a centroid almost never corresponds to an actual data point, medoid by definition must be an actual data point. 34

Fuzzy Center-Based Clustering
In classical clustering approaches, each object is assigned to a single cluster. There are many situations in which a point could reasonably be placed in more than one cluster, and these situations are better addressed by overlapping or fuzzy clustering. In fuzzy clustering, clusters are treated as fuzzy sets, and every object belongs to every cluster with membership degree between 0 (absolutely does not belong) and 1(absolutely belongs). 32 Fuzzy clustering allows an object to belong to several clusters with different membership degrees and defines belonging to the clusters as a crisp value over the interval [0, 1]. 35 The most well-known fuzzy clustering is the FCM clustering algorithm proposed by Duun 23 and extended by Bezdek. 24 If X = {x 1 , x 2 , . . . , x n } ∈ R s is a set of feature vectors (n and s are the number and dimension of data points, respectively), FCM partitions the data set into c clusters by minimizing the following evaluation function: To reduce the effect of outliers, the condition of sum of the membership values to all clusters for each point is equal to 1 was relaxed and the PCM 39 objective function was defined as where β i is the average fuzzy interacluster distance of cluster i.

PROPOSED FUZZY CENTER-BASED GRAPH CLUSTERING MODEL
If data are represented as a graph, where the nodes are objects and links represent connections among objects, then a cluster can be defined as a connected entity like a group of objects that are connected to one another, but that have no connection to objects outside the group. 34 The center-based graph clustering objective function from (6) by considering the Manhattan distance (2) becomes: and by setting where u ik = {0, 1}, n i is a center of cluster or subgraph G i , and g i is the number of cluster members of cluster i. The node k is assigned to the cluster i, and u ik = 1, if the distance from node n k to the central node n i is minimized and otherwise u ik = 0. The updating formula for the cluster center is defined by setting the differentiation of (12), with respect to n i , to 0 and solving.
(14) is solved if n i is a center of G i and has minimum distances to the cluster members. The central node of cluster i, n i , is defined as As shown in Figure 1, if communities are well separated, each node is assigned to only one community with no overlapping of communities. However, a different version of the community detection problem allows nodes to belong in more than one community, leading to the concept of overlapping communities. The intuition behind overlapping clustering is based on the fact that real complex networks are not usually divided into sharp subnetworks, but nodes may naturally belong to more than one community. 38 As shown in Figure 2, n 4 is shared between cluster 1 and cluster

The FCM objective function formulation for detecting overlapping communities is
where u ik = [0, 1] and m ∈ [1, ∞) is a weighting exponent. The probabilistic constraint of FCM that the memberships of a data point across clusters sum to 1, c i=1 u ik = 1, causes considerable trouble in noisy environments. The following simple examples show the problems associated with the probabilistic constraint used in FCM. Figure 3a shows an example of two clusters. In this case, node n 4 is shared between cluster 1 and cluster 2, whereas n 9 does not belong to neither cluster. The probabilistic constraint in FCM forces both node n 4 and n 9 to have a membership of 0.5 in each cluster. Figure 3b shows another situation containing two clusters. While n 5 typically belongs to cluster 2, the FCM membership value reflects that n 5 is shared equally between two clusters. This may cause misclassification in some pattern recognition applications. Now, we will define a new fuzzy center-based clustering model to detect overlapping communities of complex networks. The method is defined based on the PCM clustering model and adjusted to the discovery of overlapping communities. The proposed objective function is formulated as where i is the density of cluster i, which will be discussed later. The first term minimizes the distance from cluster centers as much as possible, whereas the second term forces u ik to be as large as possible, thus avoiding the trivial solution.  of J m (u ik , n c i , i ) with respect to u ik are set to zero (18) Turning to the problem of finding the optimal node as the cluster center, the updating formula for the cluster center is defined by setting the differentiation of (17), with respect to n i , to 0, and solving.
Formula (19) is solved if n i is the center of cluster i and has the minimum distance to other cluster members with respect to their membership values u ik . The center of cluster i is defined as When D ik = g j =1 |a i j − a kj |, similarity or connection to the central node has as the same impact as connections to other nodes. But in some cases, having a connection to the central node has significant value. Therefore, the distance function could be defined aŝ when 0 ≤ ω ≤ 0 , if ω → 1 in Equation 21, only the connection to the central node is considered and if ω → 1/g i , connection to the centeral node is as important as connections to the other nodes. In calculation i , our experiment suggests that the value of i equals the proportion of cluster center links to all the links present in the graph and is calculated as  The i goes to 0 if there are no links present from a cluster center, and goes to 1 if all links go to the centers. The higher the value of i , the more connections from centers to other nodes exist, resulting in a more dense cluster. Figure 4 illustrates the proposed algorithm for detecting overlapping communities.
The proper initial central node is a key step in a center-based clustering procedure. A common approach is to choose the initial centers randomly, but results are often poor. 34 We consider the Nodal degree of a node as criterion for choosing the initial centers. The Nodal degree of a node is equal to the number of lines incident with the node in the graph. It may be found by summing appropriate elements in the adjacency matrix 29 when n 0 i , i ∈ [1, g i ] is an initial center of cluster i.

CLUSTERING VALIDATION INDEX
The number of links inside the community and links with the rest of the graph are a reference guideline, which serve the basis of the most community definitions and the proper criteria to evaluate clustering process performance. Let us consider a subgraph G i of a graph G, with |G i | = g i and |G| = g, respectively. We define the internal and external degree of subgraph G i as the number of links connecting nodes inside the subgraph to the rest of the graph, respectively. The intracluster density, δ int (G i ), is defined as the ratio between the number of internal links of G i and the number of possible internal edges Similarly, the intercluster density, δ ext (G i ), is the ratio between the number of edges running from the nodes of G i to the rest of the graph and the maximum number of intercluster edges possible, i.e., For G i to be a community, we expect δ ext (G i ) to be appreciably maximum and δ int (G i ) minimum. Searching for the best tradeoff between a large δ ext (G i ) and a small δ int (G i ) is implicitly or explicitly the goal of most clustering algorithms. A simple way to do that is, e.g., maximizing the sum of the differences δ i = δ int (G i ) − δ ext (G i ). 1 Additionally, the rate of changes could be determined by the ratio of δ int (G i ) to δ ext (G i ) as δ i = δ int (G i )/δ ext (G i ). Similar to the case of δ i , the lower the value of δ i , the better is clustering results. We consider the summation of δ i and δ i as criteria to evaluate the clustering performance and determine the optimum number of clusters.

EXPERIMENTAL RESULTS
In this section, the performance of our model is tested in two artificial and two real-world networks. First, we consider a simple data set (1) containing 12 nodes as shown in Figure 5a. There are two communities in this sample. Node n 6 is shared between two clusters, while n 7 is not connected to either cluster. The membership values generated by the FCM and the PCM for n 6 and n 7 are shown in Figures 5b  and 5c.The membership values generated by the PCM reflect the difference between these two nodes. Although FCM generates the same membership values for these two nodes in each cluster, the membership value generated by the PCM for n 6 is considerably higher than membership value of n 7 .
In Figure 6, membership values generated by PCM for ω = 1, 0.5, 1/g in Equation 21 are shown. By increasing the value of ω only, the amount of closeness to the center of the cluster determines the membership value. In other words, by setting ω = 1, the membership value is determined based on only the connection to the center. Therefore, u ik equals 1 for a connected node to the center and otherwise u ik = 0. If closeness to the center is as important as closeness to other nodes, we can set ω = 1/n or any other ratio, based on application and the role of the center in the cluster.
The second example, containing 18 nodes, is shown in Figure 7a. There are two communities in this sample. Node n 6 is shared between two clusters. The membership values generated by FCM and PCM for the second sample set are shown in Figures 7b and 7c. The membership values generated by FCM for node n 6 equal for both clusters, while PCM generates a higher membership value for cluster 2 compared to cluster 1. Consequently, PCM is able to detect that n 6 typically belongs to cluster 2 rather cluster 1. Figure 8 indicates membership values generated by PCM for sample set 2 when ω = 1, 0.5, 1/g in Equation 21. By increasing the value of ω, the amount of closeness and connections to the central node becomes more important. Similar to sample 1, when ω = 1, the membership value is determined based on only the  connection to the center. Therefore, u ik = 1 for a connected node and otherwise u ik = 0. If closeness to the center is as important as closeness to other nodes, we can set ω = 1/n or any other ratio based on application.
The third sample is Zachary's karate club network with 34 members of a karate club over a period of two years. 39 The instructor of this club split the members into two groups based on their social interactions. This network has been used in several papers to evaluate cluster analysis in general. The result of segmentation by   center-based graph clustering is illustrated in Figure 9a that shows no difference in assigning the members to groups, compared to the similar studies in the literature. Figure 9b shows the results of our proposed model. In this figure, the exclusive members of cluster 1 and cluster 2 are shown using white and black. However, the nodes that are colored gray are shared between two clusters and have high membership values.
As we mentioned before, in fuzzy clustering, instead of each object belongs to only one cluster, an object belongs to all clusters by the degree of its membership value over the interval [0, 1]. The results of segmentation of Zachary's karate club network is shown in Figure 9b. In fuzzy clustering, each node belongs to each cluster based on its relative similarity and, consequently, membership value. By considering ω = 1 in (21), the network is segmented in clusters based on centers with high degree of centrality as shown in Figure 10a. But by setting ω = 1/n in (21) as shown in Figure 10b, similarity to all members of a cluster is as important as similarity to the cluster center.
The intra and intercluster density of clusters, which resulted from our model for ω = 1 and ω = 1/n, are tabulated in Table I. As summarized in Table I, the difference between intra and intercluster distance is lower when the cluster number is smaller. The trade-off between intra and intercluster density indicates that four is the optimum number of clusters. Additionally, the δ trend is lower for ω → 1 compared to ω → 1/n. This means that when we only consider the relation to cluster centers as the base for measuring similarity, ω = 1, a node with more connections  within the cluster is selected as a center. However, when ω → 1/n a central node is one which has the same or similar connection(s) in the clusters as the other members of the cluster. The fourth example for testing our model is a part of the twitter network. 40 Twitter is an online social networking and microblogging service that enables users to send and read tweets. Our sample is a part of this social network, which is constructed by 3656 nodes with 16,383 links as shown in Figure 11a. The average degree and density of this experimental graph are 51.62 and 0.014, respectively. Also, the probability distribution of nodes' degrees over the whole network is illustrated in Figure 12a.
The result of intra and intercluster density was calculated using (24) and (25) for the results of the center-based fuzzy graph clustering over 1 ≤ c ≤ 50. δ shows a downward trend as the number of the clusters increases in Figure 12b. However, its reduction rate is lower for c ≥ 7. Figure 12c shows the trend of δ over 1 ≤ c ≤ 50. Its first local minimum is c = 7. The results of segmentation by center-based fuzzy clustering is shown in Figure 11b. However, there are some other options for the optimum numbers of clusters such as c = 13, 22, 30, 38, .. in the case that a higher number of clusters is desired.

CONCLUSIONS
In this paper, a fuzzy clustering model is defined for detecting overlapping communities in complex networks. Our proposed model is developed based on the PCM clustering model and assigns each node to each cluster by degree of belonging over an interval [0,1]. Therefore, instead of one node belonging to exactly one cluster, it can belong to more than one cluster, and associated with each node is a set of membership levels. Since the constraint of summing membership values of each node over all clusters is relaxed, our proposed model is able to recognize outlier and noisy nodes in the network and, consequently, avoid misclassification in pattern recognition applications. Results indicate that our model is more powerful in detecting the communities when some nodes are either shared between clusters or disconnected from communities.
In addition, in this paper we study the role of clusters' centers in the community detection process. Our model, by adjusting the parameter ω in the defined distance function , Equation 21, could select the most central member of each cluster or the most typical node of the cluster as the cluster center. The performance of our proposed method was examined by applying it to some artificial and real-world networks. The number of links inside the community and link with the rest of the graph were used to evaluate the model. The intracluster and interclass density of clusters were used to evaluate the performance of our method and determine the optimum number of cluster. Results of our proposed model show better performance in community detection and provide a new understanding of clusters' center in network analysis. In future work, we will focus on more details of fuzzy clustering and study the role of clusters' centers in overlapping communities.