Welcome to CluSim’s documentation!

Installation

This package (will be) available in PyPI. Just run the following command on terminal to install.

>>> pip install clusim

You can also source the code directly from the github [project page](https://github.com/Hoosier-Clusters/clusim).

Examples and Usage

A first comparison

We start by importing the required modules

>>> from clusim.clustering import Clustering, print_clustering
>>> import clusim.sim as sim

The simplest way to make a Clustering is to use an elm2clu_dict which maps each element.

>>> c1 = Clustering(elm2clu_dict = {0:[0], 1:[0], 2:[1], 3:[1], 4:[2], 5:[2]})
>>> c2 = Clustering(elm2clu_dict = {0:[0], 1:[1], 2:[1], 3:[1], 4:[1], 5:[2]})
>>> print_clustering(c1)
>>> print_clustering(c2)

Finally, the similarity of the two Clusterings can be found using the Jaccard Index.

>>> sim.jaccard_index(c1, c2)

Basics of element-centric similarity

>>> from clusim.clustering import Clustering, print_clustering
>>> import clusim.sim as sim
>>> c1 = Clustering(elm2clu_dict = {0:[0], 1:[0], 2:[1], 3:[1], 4:[2], 5:[2]})
>>> c2 = Clustering(elm2clu_dict = {0:[0], 1:[1], 2:[1], 3:[1], 4:[1], 5:[2]})

The basic element-centric similarity score with a fixed alpha:

>>> sim.element_sim(c1, c2, alpha = 0.9)

We can also get the element scores. Note that since non-numberic elements are allowed, the element scores returns a dict which maps the elements to the index in the elementScore array.

>>> elementScores, relabeled_elements = sim.element_sim_elscore(c1, c2, alpha = 0.9)

The “Clustering”

class clusim.clustering.Clustering(elm2clu_dict=None, clu2elm_dict=None, hier_graph=None)

Base class for clusterings.

Parameters:
  • elm2clu_dict (dict) – optional Initialize based on an elm2clu_dict: { elementid: [clu1, clu2, … ] }. The value is a list of clusters to which the element belongs.
  • clu2elm_dict (dict) – optional Initialize based on an clu2elm_dict: { clusid: [el1, el2, … ]}. Each cluster is a key with value a list of elements which belong to it.
  • hier_graph (networkx.Graph()) – optional Initialize based on a hierarchical acyclic graph capturing the cluster membership at each scale.
copy()

Return a deep copy of the clustering.

Returns:deep copy of the clustering
>>> from clusim.clustering import Clustering, print_clustering
>>> clu = clusim.Clustering()
>>> clu2 = clu.copy()
>>> print_clustering(clu)
>>> print_clustering(clu)
from_elm2clu_dict(elm2clu_dict)

This method creates a clustering from an elm2clu_dict dictionary: { elementid: [clu1, clu2, … ] } where each element is a key with value a list of clusters to which it belongs. Clustering features are then calculated.

Parameters:elm2clu_dict (dict) – { elementid: [clu1, clu2, … ] }
>>> from clusim.clustering import Clustering, print_clustering
>>> elm2clu_dict = {0:[0], 1:[0], 2:[0,1], 3:[1], 4:[2], 5:[2]}
>>> clu = Clustering()
>>> clu.from_elm2clu_dict(elm2clu_dict)
>>> print_clustering(clu)
from_clu2elm_dict(clu2elm_dict)

This method creates a clustering from an clu2elm_dict dictionary: { clusid: [el1, el22, … ] } where each cluster is a key with value a list of elements which belong to it. Clustering features are then calculated.

Parameters:clu2elm_dict (dict) – { clusid: [el1, el2, … ] }
>>> from clusim.clustering import Clustering, print_clustering
>>> clu2elm_dict = {0:[0,1,2], 1:[2,3], 2:[4,5]}
>>> clu = Clustering()
>>> clu.from_clu2elm_dict(clu2elm_dict)
>>> print_clustering(clu)
from_cluster_list(cluster_list)

This method creates a clustering from a cluster list: [ [el1, el2, …], [el5, …], … ], a list of lists, where each inner list corresponds to the elements in a cluster. Clustering features are then calculated.

Parameters:cluster_list (list) – list of lists [ [el1, el2, …], [el5, …], … ]
>>> from clusim.clustering import Clustering, print_clustering
>>> cluster_list = [ [0,1,2], [2,3], [4,5]]
>>> clu = Clustering()
>>> clu.from_cluster_list(cluster_list)
>>> print_clustering(clu)
to_cluster_list()

This method returns a clustering in cluster list format: [ [el1, el2, …], [el5, …], … ], a list of lists, where each inner list corresponds to the elements in a cluster.

Returns:cluster_list : list of lists, [ [el1, el2, …], [el5, …], … ]
from_membership_list(membership_list)

This method creates a clustering from a membership list: [ clu_for_el1, clu_for_el2, … ], a list of cluster names where the ith entry corresponds to the cluster membership of the ith element. Clustering features are then calculated.

Note

Membership Lists can only represent partitions (no overlaps)

Parameters:membership_list (list) – list of cluster names clu_for_el1, clu_for_el2, … ]
>>> from clusim.clustering import Clustering, print_clustering
>>> membership_list = [0,0,0,1,2,2]
>>> clu = Clustering()
>>> clu.from_membership_list(membership_list)
>>> print_clustering(clu)
to_membership_list()

This method returns the clustering as a membership list: [ clu_for_el1, clu_for_el2, … ], a list of cluster names the ith entry corresponds to the cluster membership of the ith element.

Note

Membership Lists can only represent partitions (no overlaps)

Returns:list of element memberships, [ clu_for_el1, clu_for_el2, … ]
clustering_from_igraph_cover(igraphcover)

This method creates a clustering from an igraph VertexCover object. See the igraph.Cover.VertexCover class. Clustering features are then calculated.

Parameters:igraphcover (igraph.Cover.VertexCover) – the igraph VertexCover
to_clu2elm_dict()

Create a clu2elm_dict: {clusterid: [el1, el2, … ]} from the stored elm2clu_dict.

Returns:dict
to_elm2clu_dict()

Create a elm2clu_dict: {elementid: [clu1, clu2, … ]} from the stored clu2elm_dict.

Returns:dict
find_clu_size_seq()

This method finds the cluster size sequence for the clustering.

Returns:list of integers A list where the ith entry corresponds to the size of the ith cluster.
>>> from clusim.clustering import Clustering, print_clustering
>>> elm2clu_dict = {0:[0], 1:[0], 2:[0,1], 3:[1], 4:[2], 5:[2]}
>>> clu = Clustering(elm2clu_dict = elm2clu_dict)
>>> print("Cluster Size Sequence:", clu.find_clu_size_seq())
* Cluster Size Sequence: [3, 2, 2]
find_num_overlap()

This method finds the number of elements which are in more than one cluster in the clustering.

Returns:The number of elements in at least two clusters.
>>> from clusim.clustering import Clustering, print_clustering
>>> elm2clu_dict = {0:[0], 1:[0], 2:[0,1], 3:[1], 4:[2], 5:[2]}
>>> clu = Clustering(elm2clu_dict = elm2clu_dict)
>>> print("Overlap size:", clu.find_num_overlap())
* Overlap size: 1
merge_clusters(c1, c2, new_name=None)

This method merges the elements in two clusters from the clustering. The merged clustering will be named new_name if provided, otherwise it will assume the name of cluster c1.

Returns:self
>>> from clusim.clustering import Clustering, print_clustering
>>> elm2clu_dict = {0:[0], 1:[0], 2:[0], 3:[1], 4:[2], 5:[2]}
>>> clu = Clustering(elm2clu_dict = elm2clu_dict)
>>> print_clustering(clu)
>>> clu.merge_clusters(1,2, new_name = 3)
>>> print_clustering(clu)
downstream_elements(cluster)

This method finds all elements contained in a cluster from a hierarchical clustering by visiting all downstream clusters and adding their elements.

Parameters:cluster – the name of the parent cluster
Returns:element list
from_scipy_linkage(linkage_matrix, dist_rescaled=False)

This method creates a clustering from a scipy linkage object resulting from the agglomerative hierarchical clustering. Clustering features are then calculated.

Parameters:
  • linkage_matrix (numpy.matrix) – the linkage matrix from scipy
  • dist_rescaled (Boolean) – (default False) if True, the linkage distances are linearlly rescaled to be in-between 0 and 1
>>> from clusim.clustering import Clustering, print_clustering
>>> from scipy.cluster.hierarchy import dendrogram, linkage
>>> import numpy as np
>>> np.random.seed(42)
>>> data1 = np.random.multivariate_normal([10, 0], [[3, 1], [1, 4]],
                                          size=[100,])
>>> data2 = np.random.multivariate_normal([0, 20], [[3, 1], [1, 4]],
                                          size=[50,])
>>> Xdata = np.concatenate((data1, data2), )
>>> Z = linkage(X, 'ward')
>>> clu = Clustering()
>>> clu.from_scipy_linkage(Z, dist_rescaled=False)
class clusim.clustering.ClusterError(expression, message)
class clusim.clustering.Clustering(elm2clu_dict=None, clu2elm_dict=None, hier_graph=None)

Base class for clusterings.

Parameters:
  • elm2clu_dict (dict) – optional Initialize based on an elm2clu_dict: { elementid: [clu1, clu2, … ] }. The value is a list of clusters to which the element belongs.
  • clu2elm_dict (dict) – optional Initialize based on an clu2elm_dict: { clusid: [el1, el2, … ]}. Each cluster is a key with value a list of elements which belong to it.
  • hier_graph (networkx.Graph()) – optional Initialize based on a hierarchical acyclic graph capturing the cluster membership at each scale.
copy()

Return a deep copy of the clustering.

Returns:deep copy of the clustering
>>> from clusim.clustering import Clustering, print_clustering
>>> clu = clusim.Clustering()
>>> clu2 = clu.copy()
>>> print_clustering(clu)
>>> print_clustering(clu)
from_elm2clu_dict(elm2clu_dict)

This method creates a clustering from an elm2clu_dict dictionary: { elementid: [clu1, clu2, … ] } where each element is a key with value a list of clusters to which it belongs. Clustering features are then calculated.

Parameters:elm2clu_dict (dict) – { elementid: [clu1, clu2, … ] }
>>> from clusim.clustering import Clustering, print_clustering
>>> elm2clu_dict = {0:[0], 1:[0], 2:[0,1], 3:[1], 4:[2], 5:[2]}
>>> clu = Clustering()
>>> clu.from_elm2clu_dict(elm2clu_dict)
>>> print_clustering(clu)
from_clu2elm_dict(clu2elm_dict)

This method creates a clustering from an clu2elm_dict dictionary: { clusid: [el1, el22, … ] } where each cluster is a key with value a list of elements which belong to it. Clustering features are then calculated.

Parameters:clu2elm_dict (dict) – { clusid: [el1, el2, … ] }
>>> from clusim.clustering import Clustering, print_clustering
>>> clu2elm_dict = {0:[0,1,2], 1:[2,3], 2:[4,5]}
>>> clu = Clustering()
>>> clu.from_clu2elm_dict(clu2elm_dict)
>>> print_clustering(clu)
from_cluster_list(cluster_list)

This method creates a clustering from a cluster list: [ [el1, el2, …], [el5, …], … ], a list of lists, where each inner list corresponds to the elements in a cluster. Clustering features are then calculated.

Parameters:cluster_list (list) – list of lists [ [el1, el2, …], [el5, …], … ]
>>> from clusim.clustering import Clustering, print_clustering
>>> cluster_list = [ [0,1,2], [2,3], [4,5]]
>>> clu = Clustering()
>>> clu.from_cluster_list(cluster_list)
>>> print_clustering(clu)
to_cluster_list()

This method returns a clustering in cluster list format: [ [el1, el2, …], [el5, …], … ], a list of lists, where each inner list corresponds to the elements in a cluster.

Returns:cluster_list : list of lists, [ [el1, el2, …], [el5, …], … ]
from_membership_list(membership_list)

This method creates a clustering from a membership list: [ clu_for_el1, clu_for_el2, … ], a list of cluster names where the ith entry corresponds to the cluster membership of the ith element. Clustering features are then calculated.

Note

Membership Lists can only represent partitions (no overlaps)

Parameters:membership_list (list) – list of cluster names clu_for_el1, clu_for_el2, … ]
>>> from clusim.clustering import Clustering, print_clustering
>>> membership_list = [0,0,0,1,2,2]
>>> clu = Clustering()
>>> clu.from_membership_list(membership_list)
>>> print_clustering(clu)
to_membership_list()

This method returns the clustering as a membership list: [ clu_for_el1, clu_for_el2, … ], a list of cluster names the ith entry corresponds to the cluster membership of the ith element.

Note

Membership Lists can only represent partitions (no overlaps)

Returns:list of element memberships, [ clu_for_el1, clu_for_el2, … ]
clustering_from_igraph_cover(igraphcover)

This method creates a clustering from an igraph VertexCover object. See the igraph.Cover.VertexCover class. Clustering features are then calculated.

Parameters:igraphcover (igraph.Cover.VertexCover) – the igraph VertexCover
to_clu2elm_dict()

Create a clu2elm_dict: {clusterid: [el1, el2, … ]} from the stored elm2clu_dict.

Returns:dict
to_elm2clu_dict()

Create a elm2clu_dict: {elementid: [clu1, clu2, … ]} from the stored clu2elm_dict.

Returns:dict
find_clu_size_seq()

This method finds the cluster size sequence for the clustering.

Returns:list of integers A list where the ith entry corresponds to the size of the ith cluster.
>>> from clusim.clustering import Clustering, print_clustering
>>> elm2clu_dict = {0:[0], 1:[0], 2:[0,1], 3:[1], 4:[2], 5:[2]}
>>> clu = Clustering(elm2clu_dict = elm2clu_dict)
>>> print("Cluster Size Sequence:", clu.find_clu_size_seq())
* Cluster Size Sequence: [3, 2, 2]
find_num_overlap()

This method finds the number of elements which are in more than one cluster in the clustering.

Returns:The number of elements in at least two clusters.
>>> from clusim.clustering import Clustering, print_clustering
>>> elm2clu_dict = {0:[0], 1:[0], 2:[0,1], 3:[1], 4:[2], 5:[2]}
>>> clu = Clustering(elm2clu_dict = elm2clu_dict)
>>> print("Overlap size:", clu.find_num_overlap())
* Overlap size: 1
merge_clusters(c1, c2, new_name=None)

This method merges the elements in two clusters from the clustering. The merged clustering will be named new_name if provided, otherwise it will assume the name of cluster c1.

Returns:self
>>> from clusim.clustering import Clustering, print_clustering
>>> elm2clu_dict = {0:[0], 1:[0], 2:[0], 3:[1], 4:[2], 5:[2]}
>>> clu = Clustering(elm2clu_dict = elm2clu_dict)
>>> print_clustering(clu)
>>> clu.merge_clusters(1,2, new_name = 3)
>>> print_clustering(clu)
downstream_elements(cluster)

This method finds all elements contained in a cluster from a hierarchical clustering by visiting all downstream clusters and adding their elements.

Parameters:cluster – the name of the parent cluster
Returns:element list
from_scipy_linkage(linkage_matrix, dist_rescaled=False)

This method creates a clustering from a scipy linkage object resulting from the agglomerative hierarchical clustering. Clustering features are then calculated.

Parameters:
  • linkage_matrix (numpy.matrix) – the linkage matrix from scipy
  • dist_rescaled (Boolean) – (default False) if True, the linkage distances are linearlly rescaled to be in-between 0 and 1
>>> from clusim.clustering import Clustering, print_clustering
>>> from scipy.cluster.hierarchy import dendrogram, linkage
>>> import numpy as np
>>> np.random.seed(42)
>>> data1 = np.random.multivariate_normal([10, 0], [[3, 1], [1, 4]],
                                          size=[100,])
>>> data2 = np.random.multivariate_normal([0, 20], [[3, 1], [1, 4]],
                                          size=[50,])
>>> Xdata = np.concatenate((data1, data2), )
>>> Z = linkage(X, 'ward')
>>> clu = Clustering()
>>> clu.from_scipy_linkage(Z, dist_rescaled=False)
exception clusim.clustering.ClusterError(expression, message)
clusim.clustering.print_clustering(clustering)

A function to print a clustering. Clusters are seperated by ‘|’. The fuction will only print the leaf layer of a Hierarchical Clustering.

Parameters:clustering (Clsutering) – The clustering to print
>>> import clusim.clugen as clugen
>>> from clusim.clustering import print_clustering
>>> clu = clugen.make_equal_clustering(n_elements = 9, n_clusters = 3)
>>> print_clustering(clu)

Clustering Generation

clusim.clugen.make_equal_clustering(n_elements, n_clusters)

This function creates a random clustering with equally sized clusters. If n_elements % n_clusters != 0, cluster sizes will differ by one element.

Parameters:
  • n_elements (int) – The number of elements
  • n_clusters (int) – The number of clusters
Returns:

The new clustering with equally sized clusters.

>>> import clusim.clugen as clugen
>>> from clusim.clustering import print_clustering
>>> clu = clugen.make_equal_clustering(n_elements = 9, n_clusters = 3)
>>> print_clustering(clu)
clusim.clugen.make_random_clustering(n_elements=1, n_clusters=1, clu_size_seq=[1, 2], random_model='all', tol=1e-15)

This function creates a random clustering according to one of three random models. It is a wrapper around the specific functions for each random model.

Parameters:
  • n_elements (int) – The number of elements
  • n_clusters (int) – The number of clusters
  • random_mode (str) –

    The random model to use:

    ’all’ : uniform distrubtion over the set of all clusterings of
    n_elements
    ’num’ : uniform distrubtion over the set of all clusterings of
    n_elements in n_clusters

    ’perm’ : the Permutaiton Model

  • tol (float) – optional The tolerance used by the algorithm for ‘all’ clusterings
Returns:

The new clustering.

>>> import clusim.clugen as clugen
>>> from clusim.clustering import print_clustering
>>> clu = clugen.make_random_clustering(n_elements = 9, n_clusters = 3,
                                 random_model = 'num')
>>> print_clustering(clu)
clusim.clugen.make_singleton_clustering(n_elements)

This function creates a clustering with each element in its own cluster.

Parameters:n_elements (int) – The number of elements
Returns:The new clustering.
>>> import import clusim.clugen as clugen
>>> from clusim.clustering import print_clustering
>>> clu = clugen.make_singleton_clustering(n_elements = 9)
>>> print_clustering(clu)
clusim.clugen.make_random_dendrogram(n_elements)

This function creates a random Hierarchical Clustering.

:param int n_elements The number of elements

Returns:The new clustering.
clusim.clugen.shuffle_memberships(clustering, percent=1.0)

This function creates a new clustering by shuffling the element memberships from the original clustering.

Parameters:
  • clustering (Clustering) – The original clustering.
  • percent (float) – optional (default 1.0) The fractional percentage (between 0.0 and 1.0) of the elements to shuffle.
Returns:

The new clustering.

>>> import clusim.clugen as clugen
>>> from clusim.clustering import print_clustering
>>> orig_clu = clugen.make_random_clustering(n_elements = 9, n_clusters = 3,
                                      random_model = 'num')
>>> print_clustering(orig_clu)
>>> shuffle_clu = clugen.shuffle_memberships(orig_clu, percent = 0.5)
>>> print_clustering(shuffle_clu)
clusim.clugen.shuffle_memberships_pa(clustering, n_steps=1, constant_num_clusters=True)

This function creates a new clustering by shuffling the element memberships from the original clustering according to the preferential attachment model.

See [GA17] for a detailed explaination of the preferential attachment model.

Parameters:
  • clustering (Clustering) – The original clustering.
  • n_steps (int) – optional (default 1) The number of times to run the preferential attachment algorithm.
  • constant_num_clusters (Boolean) – optional (default True) Reject a shuffling move if it leaves a cluster with no elements. Set to True to keep the number of clusters constant.
Returns:

The new clustering with shuffled memberships.

>>> import clusim.clugen as clugen
>>> from clusim.clustering import print_clustering
>>> orig_clu = clugen.make_random_clustering(n_elements=9, n_clusters=3,
                                      random_model='num')
>>> print_clustering(orig_clu)
>>> shuffle_clu = clugen.shuffle_memberships_pa(orig_clu, n_steps=10,
                                         constant_num_clusters=True)
>>> print_clustering(shuffle_clu)
clusim.clugen.generate_random_partition_all(n_elements, tol=1e-15)

This function creates a random clustering according to the ‘All’ random model by uniformly selecting a clustering from the ensemble of all clusterings with n_elements.

Parameters:
  • n_elements (int) – The number of elements
  • tol (float) – (optional) The tolerance used by the algorithm to approximate the probability distrubtion
Returns:

The randomly genderated clustering.

>>> import clusim.clugen as clugen
>>> from clusim.clustering import print_clustering
>>> clu = clugen.generate_random_partition_all(n_elements = 9)
>>> print_clustering(clu)
clusim.clugen.enumerate_random_partition_num(n_elements, n_clusters)

A generator for every partition in ‘Num’, the ensemble of all clusterings with n_elements grouped into n_clusters, non-empty clusters.

Based on the solution provided by Adeel Zafar Soomro: a link.

which was itself based on the algorithm from Knuth: (Algorithm U) is described by Knuth in the Art of Computer Programming, Volume 4, Fascicle 3B

Parameters:
  • n_elements (int) – The number of elements
  • n_clusters (int) – The number of clusters
Returns:

The new clustering as a cluster list.

>>> import clusim.clugen as clugen
>>> from clusim.clustering import print_clustering
>>> for clu in clugen.clustering_ensemble_generator_num(n_elements=5, n_clusters=3):
>>>     print_clustering(clu)

Clustering Similarity

The different clustering similarity measures available.

Pairwise Counting Measures

clusim.sim.contingency_table(clustering1, clustering2)

This function creates the contigency table between two clusterings.

Parameters:
  • clustering1 (Clustering) – The first clustering.
  • clustering2 (Clustering) – The second clustering.
Returns:

The clustering1.n_clusters by clustering2.n_clusters contigency table as a list of lists

>>> import clusim.clugen as clugen
>>> clustering1 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
                                         random_model = 'num')
>>> clustering2 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
                                         random_model = 'num')
>>> cont_table = contingency_table(clustering1, clustering2)
>>> print(cont_table)
clusim.sim.count_pairwise_cooccurence(clustering1, clustering2)

This function finds the pairwise cooccurence counts between two clusterings.

Parameters:
  • clustering1 (Clustering) – The first clustering.
  • clustering2 (Clustering) – The second clustering.
Returns:

N11 : int

The number of element pairs assigned to the same clusters in both clusterings

N10 : int

The number of element pairs assigned to the same clusters in clustering1, but different clusters in clustering2

N01 : int

The number of element pairs assigned to different clusters in clustering1, but the same clusters in clustering2

N00 : int

The number of element pairs assigned to different clusters in both clusterings

>>> import clusim.clugen as clugen
>>> import clusim.sim as sim
>>> from clusim.clustering import print_clustering
>>> clustering1 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
                                         random_model='num')
>>> clustering2 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
                                         random_model='num')
>>> N11, N10, N01, N00 = sim.count_pairwise_cooccurence(clustering1,
                                                    clustering2)
>>> print_clustering(clustering1)
>>> print_clustering(clustering2)
>>> print(N11,
        "element pairs assigned to the same clusters in both clusterings")
>>> print(N10,
        "element pairs assigned to the same clusters in clustering1, but "
        "different clusters in clustering2")
>>> print(N01, "element pairs assigned to different clusters in "
               "clustering1, but the same clusters in clustering2")
>>> print(N00, "element pairs assigned to different clusters in both "
               "clusterings")
clusim.sim.jaccard_index(clustering1, clustering2)

This function calculates the Jaccard index between two clusterings [Jac12].

J = N11/(N11+N10+N01)

Parameters:
  • clustering1 (Clustering) – The first clustering.
  • clustering2 (Clustering) – The second clustering.
Returns:

The Jaccard index (between 0.0 and 1.0)

>>> import clusim.clugen as clugen
>>> import clusim.sim as sim
>>> clustering1 = clugen..make_random_clustering(n_elements=9,
                                                n_clusters=3,
                                                random_model='num')
>>> clustering2 = clugen.make_random_clustering(n_elements=9,
                                                n_clusters=3,
                                                random_model='num')
>>> print(sim.jaccard_index(clustering1, clustering2) )
clusim.sim.rand_index(clustering1, clustering2)

This function calculates the Rand index between two clusterings [Ran71].

RI = (N11 + N00) / (N11 + N10 + N01 + N00)

Parameters:
  • clustering1 (Clustering) – The first clustering.
  • clustering2 (Clustering) – The second clustering.
Returns:

The Rand index (between 0.0 and 1.0)

>>> import clusim.clugen as clugen
>>> import clusim.sim as sim
>>> clustering1 = clugen..make_random_clustering(n_elements=9, n_clusters=3,
                                                random_model='num')
>>> clustering2 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
                                                random_model='num')
>>> print(sim.rand_index(clustering1, clustering2))
clusim.sim.fowlkes_mallows_index(clustering1, clustering2)

This function calculates the Fowlkes and Mallows index between two clusterings [FM83].

FM = N11 / sqrt( (N11 + N10) * (N11 + N01) )

Parameters:
  • clustering1 (Clustering) – The first clustering.
  • clustering2 (Clustering) – The second clustering.
Returns:

The Fowlkes and Mallows index (between 0.0 and 1.0)

>>> import clusim.clugen as clugen
>>> import clusim.sim as sim
>>> clustering1 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
                                                random_model='num')
>>> clustering2 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
                                                random_model='num')
>>> print(sim.fowlkes_mallows_index(clustering1, clustering2))
clusim.sim.fmeasure(clustering1, clustering2)

This function calculates the F-measure between two clusterings.

Also known as: Czekanowski index Dice Symmetric index Sorensen index

F = 2*N11 / (2*N11 + N10 + N01)

Parameters:
  • clustering1 (Clustering) – The first clustering.
  • clustering2 (Clustering) – The second clustering.
Returns:

The F-measure (between 0.0 and 1.0)

>>> import clusim.clugen as clugen
>>> import clusim.sim as sim
>>> clustering1 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
                                                random_model='num')
>>> clustering2 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
                                                random_model='num')
>>> print(sim.fmeasure(clustering1, clustering2))
clusim.sim.purity_index(clustering1, clustering2)

This function calculates the Purity index between two clusterings.

Parameters:
  • clustering1 (Clustering) – The first clustering.
  • clustering2 (Clustering) – The second clustering.
Returns:

The Purity index (between 0.0 and 1.0)

>>> import clusim.clugen as clugen
>>> import clusim.sim as sim
>>> clustering1 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
                                                random_model='num')
>>> clustering2 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
                                                random_model='num')
>>> print(sim.purity_index(clustering1, clustering2))
clusim.sim.classification_error(clustering1, clustering2)

This function calculates the Jaccard index between two clusterings.

CE = 1 - PI

Parameters:
  • clustering1 (Clustering) – The first clustering.
  • clustering2 (Clustering) – The second clustering.
Returns:

The Classification Error (between 0.0 and 1.0)

Note

CE is a distance measure, it is 0 for identical clusterings

>>> import clusim.clugen as clugen
>>> import clusim.sim as sim
>>> clustering1 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
                                                random_model='num')
>>> clustering2 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
                                                random_model='num')
>>> print(sim.classification_error(clustering1, clustering2))
clusim.sim.czekanowski_index(clustering1, clustering2)

This function calculates the Czekanowski index between two clusterings.

See Fmeasure

clusim.sim.dice_index(clustering1, clustering2)

This function calculates the Dice index between two clusterings.

See Fmeasure

clusim.sim.sorensen_index(clustering1, clustering2)

This function calculates the Sorensen index between two clusterings.

See Fmeasure

clusim.sim.rogers_tanimoto_index(clustering1, clustering2)

This function calculates the Rogers and Tanimoto index between two clusterings.

RT = (N11 + N00)/(N11 + 2*(N10+N01) + N00)

Parameters:
  • clustering1 (Clustering) – The first clustering.
  • clustering2 (Clustering) – The second clustering.
Returns:

The Rogers and Tanimoto index (between 0.0 and 1.0)

>>> import clusim.clugen as clugen
>>> import clusim.sim as sim
>>> clustering1 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
                                                random_model='num')
>>> clustering2 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
                                                random_model='num')
>>> print(sim.rogers_tanimoto_index(clustering1, clustering2))
clusim.sim.southwood_index(clustering1, clustering2)

calculate the southwood index

N11 / (N10 + N01)

Parameters:
  • clustering1 (Clustering) – The first clustering.
  • clustering2 (Clustering) – The second clustering.
Returns:

The Southwood index (between 0 and 1.0)

>>> import clusim.clugen as clugen
>>> import clusim.sim as sim
>>> clustering1 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
                                                random_model='num')
>>> clustering2 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
                                                random_model='num')
>>> print(sim.southwood_index(clustering1, clustering2))
clusim.sim.pearson_correlation(clustering1, clustering2)

This function calculates the Pearson Correlation between two clusterings.

PC = (N11*N00 - N01*N10) / ((N11+N10) * (N11+N01) * (N00+N10) * (N00+N01))

Parameters:
  • clustering1 (Clustering) – The first clustering.
  • clustering2 (Clustering) – The second clustering.
Returns:

The Pearson Correlation (between -1.0 and 1.0)

>>> import clusim.clugen as clugen
>>> import clusim.sim as sim
>>> clustering1 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
                                                random_model='num')
>>> clustering2 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
                                                random_model='num')
>>> print(sim.pearson_correlation(clustering1, clustering2))

Information Theoretic Measures

clusim.sim.nmi(clustering1, clustering2, norm_type='sum')

This function calculates the Normalized Mutual Information (NMI) between two clusterings [DDiazGDA05].

NMI = (S(c1) + S(c2) - S(c1, c2)) / norm(c1, c2)

where S(c1) is the Shannon Entropy of the clustering size distrubtion, S(c1, c2) is the Shannon Entropy of the join clustering size distrubtion, and norm(c1,c2) is a normalizetion term.

Parameters:
  • clustering1 (Clustering) – The first clustering.
  • clustering2 (Clustering) – The second clustering.
  • norm_type (str) – ‘sum’ (default), ‘max’, ‘min’, ‘sqrt’, ‘none’ The normalization type: ‘sum’ uses the average of the two clustering entropies, ‘max’ uses the maximum of the two clustering entropies, ‘min’ uses the minimum of the two clustering entropies, ‘sqrt’ uses the geometric mean of the two clustering entropies, ‘none’ returns the Mutual Information without a normalization
Returns:

The Normalized Mutual Information index (between 0.0 and inf)

>>> import clusim.clugen as clugen
>>> import clusim.sim as sim
>>> clustering1 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
                                                random_model='num')
>>> clustering2 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
                                                random_model='num')
>>> print(sim.nmi(clustering1, clustering2, norm_type='sum'))
>>> print(sim.nmi(clustering1, clustering2, norm_type='max'))
>>> print(sim.nmi(clustering1, clustering2, norm_type='min'))
>>> print(sim.nmi(clustering1, clustering2, norm_type='sqrt'))
clusim.sim.vi(clustering1, clustering2, norm_type='none')

This function calculates the Variation of Information (VI) between two clusterings [Mei03].

VI is technically a distance measure and can assume values in the range [0, inf), where 0 denotes identical clusterings.

VI = 2*S(c1, c2) - S(c1) - S(c2)

where S(c1) is the Shannon Entropy of the clustering size distrubtion, and S(c1, c2) is the Shannon Entropy of the join clustering size distrubtion.

The VI can be transformed into a clustering similarity measure via the appropraite normalization.

VI_{sim} = 1 - 0.5*((S(c1,c2) - S(c1))/S(c2) + (S(c1,c2) - S(c2))/S(c1))

Parameters:
  • clustering1 (Clustering) – The first clustering.
  • clustering2 (Clustering) – The second clustering.
norm_type : ‘none’ (default) or ‘entropy’
The normalization type. ‘none’ returns the stanard VI as a distance metric, ‘entropy’ retuns the normalized VI as a similarity measure
Returns:The Variation of Information index (between 0.0 and inf)
>>> import clusim.clugen as clugen
>>> import clusim.sim as sim
>>> clustering1 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
                                                random_model='num')
>>> clustering2 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
                                                random_model='num')
>>> print(sim.vi(clustering1, clustering2))

Correction for Chance

clusim.sim.corrected_chance(clustering1, clustering2, measure='jaccard_index', random_model='perm', norm_type='sum', n_samples=100)

This function calculates the adjusted Similarity for one of six random models.

Note

Clustering 2 is considered the gold-standard clustering for one-sided expectations

Parameters:
  • clustering1 (Clustering) – The first clustering.
  • clustering2 (Clustering) – The second clustering.
  • measure (str) – The similarity measure to evalute. Must be one of the available_similarity_measures.
  • random_model (str) –

    The random model to use:

    ’all’ : uniform distrubtion over the set of all clusterings of
    n_elements
    ’all1’ : one-sided selction from the uniform distrubtion over the set
    of all clusterings of n_elements
    ’num’ : uniform distrubtion over the set of all clusterings of
    n_elements in n_clusters
    ’num1’ : one-sided selction from the uniform distrubtion over the set
    of all clusterings of n_elements in n_clusters

    ’perm’ : the permutation model for a fixed cluster size sequence

    ’perm1’ : one-sided selction from the permutation model for a fixed
    cluster size sequence, same as ‘perm’
n_samples : int
The number of random Clusterings sampled to determine the expected similarity.
Returns:The adjusted Similarity measure
>>> import clusim.clugen as clugen
>>> import clusim.sim as sim
>>> clustering1 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
                                                random_model='all')
>>> clustering2 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
                                                random_model='all')
>>> print(sim.corrected_chance(clustering1, clustering2, measure='jaccard_index',
                                  random_model='all'))
>>> print(sim.corrected_chance(clustering1, clustering2, measure='jaccard_index',
                                  random_model='all1'))
>>> print(sim.corrected_chance(clustering1, clustering2, measure='jaccard_index',
                                  random_model='num'))
>>> print(sim.corrected_chance(clustering1, clustering2, measure='jaccard_index',
                                  random_model='num1'))
>>> print(sim.corrected_chance(clustering1, clustering2, measure='jaccard_index',
                                  random_model='perm'))
>>> print(sim.corrected_chance(clustering1, clustering2, measure='jaccard_index',
                                  random_model='perm1'))
clusim.sim.sample_expected_sim(clustering1, clustering2, measure='jaccard_index', random_model='perm', n_samples=1, keep_samples=False)

This function calculates the expected Similarity for all pair-wise comparisons between Clusterings drawn from one of six random models.

Note

Clustering 2 is considered the gold-standard clustering for one-sided expectations

Parameters:
  • clustering1 (Clustering) – The first clustering.
  • clustering2 (Clustering) – The second clustering.
  • measure (str) – The similarity measure to evalute. Must be one of the measures listed in sim.available_similarity_measures.
  • random_model (string) –

    The random model to use:

    ’all’ : uniform distrubtion over the set of all clusterings of
    n_elements
    ’all1’ : one-sided selction from the uniform distrubtion over the set
    of all clusterings of n_elements
    ’num’ : uniform distrubtion over the set of all clusterings of
    n_elements in n_clusters
    ’num1’ : one-sided selction from the uniform distrubtion over the set
    of all clusterings of n_elements in n_clusters

    ’perm’ : the permutation model for a fixed cluster size sequence

    ’perm1’ : one-sided selction from the permutation model for a fixed
    cluster size sequence, same as ‘perm’
  • n_samples (int) – The number of random Clusterings sampled to determine the expected similarity.
Returns:

The expected Similarity measure for all pair-wise comparisons under a random model

>>> import clusim.clugen as clugen
>>> import clusim.sim as sim
>>> c1 = clugen.make_random_clustering(n_elements=9, n_clusters=3, random_model='all')
>>> c2 = clugen.make_random_clustering(n_elements=9, n_clusters=3, random_model='all')
>>> print(sim.sample_expected_sim(c1, c2, measure='jaccard_index', random_model='all', n_samples=50))
>>> print(sim.sample_expected_sim(c1, c2, measure='jaccard_index', random_model='all1', n_samples=50))
>>> print(sim.sample_expected_sim(c1, c2, measure='jaccard_index', random_model='num', n_samples=50))
>>> print(sim.sample_expected_sim(c1, c2, measure='jaccard_index', random_model='num1', n_samples=50))
>>> print(sim.sample_expected_sim(c1, c2, measure='jaccard_index', random_model='perm', n_samples=50))
>>> print(sim.sample_expected_sim(c1, c2, measure='jaccard_index', random_model='perm1', n_samples=50) )
clusim.sim.expected_rand_index(n_elements, random_model='num', n_clusters1=2, n_clusters2=2, clu_size_seq1=None, clu_size_seq2=None)

This function calculates the expectation of the Rand index between all pairs of clusterings drawn from one of six random models.

See [HA85] and [GA17] for a detailed derivation and explaination of the different random models.

Note

Clustering 2 is considered the gold-standard clustering for one-sided expectations

Parameters:
  • n_elements (int) – The number of elements
  • n_clusters1 (int) – optional The number of clusters in the first clustering
  • n_clusters2 (int) – optional The number of clusters in the second clustering, considered the gold-standard clustering for the one-sided expecations
  • clu_size_seq1 (list) – optional The cluster size seqence of the first clustering as a list of ints
  • clu_size_seq2 (list) – optional The cluster size seqence of the second clustering as a list of ints
  • random_model (str) –

    The random model to use:

    ’all’ : uniform distrubtion over the set of all clusterings of
    n_elements
    ’all1’ : one-sided selction from the uniform distrubtion over the set
    of all clusterings of n_elements
    ’num’ : uniform distrubtion over the set of all clusterings of
    n_elements in n_clusters
    ’num1’ : one-sided selction from the uniform distrubtion over the set
    of all clusterings of n_elements in n_clusters

    ’perm’ : the permutation model for a fixed cluster size sequence

    ’perm1’ : one-sided selction from the permutation model for a fixed
    cluster size sequence, same as ‘perm’
Returns:

The expected Rand index (between 0.0 and 1.0)

>>> import clusim.sim as sim
>>> print(sim.expected_rand_index(n_elements=5, random_model='all'))
>>> print(sim.expected_rand_index(n_elements=5, random_model='all1',
                                     clu_size_seq2=[1,1,3]))
>>> print(sim.expected_rand_index(n_elements=5, , random_model='num',
                                     n_clusters1=2, n_clusters2=3))
>>> print(sim.expected_rand_index(n_elements=5, random_model='num1',
                                     n_clusters1=2, clu_size_seq2=[1,1,3]))
>>> print(sim.expected_rand_index(n_elements=5, random_model='perm',
                                     clu_size_seq1=[2,3],
                                     clu_size_seq2=[1,1,3]))
>>> print(sim.expected_rand_index(n_elements=5, random_model='perm1',
                                     clu_size_seq1=[2,3],
                                     clu_size_seq2=[1,1,3]))
clusim.sim.adjrand_index(clustering1, clustering2, random_model='perm')

This function calculates the adjusted Rand index for one of six random models.

See [HA85] and [GA17] for a detailed derivation and explaination of the different random models.

Note

Clustering 2 is considered the gold-standard clustering for one-sided expectations

Parameters:
  • clustering1 (Clustering) – The first clustering.
  • clustering2 (Clustering) – The second clustering.
  • random_model (str) –

    The random model to use:

    ’all’ : uniform distrubtion over the set of all clusterings of
    n_elements
    ’all1’ : one-sided selction from the uniform distrubtion over the set
    of all clusterings of n_elements
    ’num’ : uniform distrubtion over the set of all clusterings of
    n_elements in n_clusters
    ’num1’ : one-sided selction from the uniform distrubtion over the set
    of all clusterings of n_elements in n_clusters

    ’perm’ : the permutation model for a fixed cluster size sequence

    ’perm1’ : one-sided selction from the permutation model for a fixed
    cluster size sequence, same as ‘perm’
Returns:

The adjusted_rand Rand index

>>> import clusim.clugen as clugen
>>> import clusim.sim as sim
>>> clustering1 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
                                                random_model='all')
>>> clustering2 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
                                                random_model='all')
>>> print(sim.adjrand_index(clustering1, clustering2,
                               random_model='all'))
>>> print(sim.adjrand_index(clustering1, clustering2,
                               random_model='all1'))
>>> print(sim.adjrand_index(clustering1, clustering2,
                               random_model='num'))
>>> print(sim.adjrand_index(clustering1, clustering2,
                               random_model='num1'))
>>> print(sim.adjrand_index(clustering1, clustering2,
                               random_model='perm'))
>>> print(sim.adjrand_index(clustering1, clustering2,
                               random_model='perm1'))
clusim.sim.adj_mi(clustering1, clustering2, random_model='perm', norm_type='sum', logbase=2)

This function calculates the adjusted Mutual Information for one of six random models.

See [GA17] for a detailed derivation and explaination of the different random models.

Note

Clustering 2 is considered the gold-standard clustering for one-sided expectations

Parameters:
  • clustering1 (Clustering) – The first clustering.
  • clustering2 (Clustering) – The second clustering.
  • random_model (string) –

    The random model to use:

    ’all’ : uniform distrubtion over the set of all clusterings of n_elements

    ’all1’ : one-sided selction from the uniform distrubtion over the set
    of all clusterings of n_elements
    ’num’ : uniform distrubtion over the set of all clusterings of
    n_elements in n_clusters
    ’num1’ : one-sided selction from the uniform distrubtion over the set
    of all clusterings of n_elements in n_clusters

    ’perm’ : the permutation model for a fixed cluster size sequence

    ’perm1’ : one-sided selction from the permutation model for a fixed
    cluster size sequence, same as ‘perm’
  • norm_type (str) – ‘sum’ (default), ‘max’, ‘min’, ‘sqrt’, ‘none’ The normalization type: ‘sum’ uses the average of the two clustering entropies, ‘max’ uses the maximum of the two clustering entropies, ‘min’ uses the minimum of the two clustering entropies, ‘sqrt’ uses the geometric mean of the two clustering entropies, ‘none’ returns the Mutual Information without a normalization
  • logbase (float) – (default) 2 The base of all logarithms (recommended to use 2 for bits).
Returns:

The adjusted Mutual Information

>>> import clusim.clugen as clugen
>>> import clusim.sim as sim
>>> clustering1 = clugen.make_random_clustering(n_elements=9, n_clusters=3,random_model='all')
>>> clustering2 = clugen.make_random_clustering(n_elements=9, n_clusters=3,random_model='all')
>>> print(sim.adj_mi(clustering1, clustering2, random_model='all'))
>>> print(sim.adj_mi(clustering1, clustering2, random_model='all1'))
>>> print(sim.adj_mi(clustering1, clustering2,random_model='num'))
>>> print(sim.adj_mi(clustering1, clustering2,random_model='num1'))
>>> print(sim.adj_mi(clustering1, clustering2,random_model='perm'))
>>> print(sim.adj_mi(clustering1, clustering2,random_model='perm1'))
clusim.sim.expected_mi(n_elements, n_clusters1=2, n_clusters2=2, clu_size_seq1=None, clu_size_seq2=None, logbase=2, random_model='num')

This function calculates the expectation of the Mutual Informaiton between all pairs of clusterings drawn from one of six random models.

See [GA17] for a detailed derivation and explaination of the different random models.

Note

Clustering 2 is considered the gold-standard clustering for one-sided expectations

Parameters:
  • n_elements (int) – The number of elements
  • n_clusters1 (int) – optional The number of clusters in the first clustering
  • n_clusters2 (int) – optional The number of clusters in the second clustering, considered the gold-standard clustering for the one-sided expecations
  • clu_size_seq1 (list) – optional The cluster size seqence of the first clustering as a list of ints.
  • clu_size_seq2 (list) – optional The cluster size seqence of the second clustering as a list of ints.
  • random_model (str) –

    The random model to use:

    ’all’ : uniform distrubtion over the set of all clusterings of
    n_elements
    ’all1’ : one-sided selction from the uniform distrubtion over the set
    of all clusterings of n_elements
    ’num’ : uniform distrubtion over the set of all clusterings of
    n_elements in n_clusters
    ’num1’ : one-sided selction from the uniform distrubtion over the set
    of all clusterings of n_elements in n_clusters

    ’perm’ : the permutation model for a fixed cluster size sequence

    ’perm1’ : one-sided selction from the permutation model for a fixed
    cluster size sequence, same as ‘perm’
Returns:

The expected MI (between 0.0 and inf)

>>> import clusim.sim as sim
>>> print(sim.expected_mi(n_elements=5, random_model='all'))
>>> print(sim.expected_mi(n_elements=5, random_model='all1',
                                     clu_size_seq2=[1,1,3]))
>>> print(sim.expected_mi(n_elements=5, , random_model='num',
                                     n_clusters1=2, n_clusters2=3))
>>> print(sim.expected_mi(n_elements=5, random_model='num1',
                                     n_clusters1=2, clu_size_seq2=[1,1,3]))
>>> print(sim.expected_mi(n_elements=5, random_model='perm',
                                     clu_size_seq1=[2,3],
                                     clu_size_seq2=[1,1,3]))
>>> print(sim.expected_mi(n_elements=5, random_model='perm1',
                                     clu_size_seq1=[2,3],
                                     clu_size_seq2=[1,1,3]))

Overlapping Clustering Similarity

clusim.sim.onmi(clustering1, clustering2)

This function calculates the overlaping normalized mututal information.

See [LFK09] for a detailed derivation and explaination of the measure.

Parameters:
  • clustering1 (Clustering) – The first clustering.
  • clustering2 (Clustering) – The second clustering.
Returns:

the overlaping normalized mutual information

clusim.sim.omega_index(clustering1, clustering2)

This function calculates the omega index between two clusterings.

See [CD88] for a detailed derivation and explaination of the measure.

Parameters:
  • clustering1 (Clustering) – The first clustering.
  • clustering2 (Clustering) – The second clustering.
Returns:

the omega index

clusim.sim.geometric_accuracy(clustering1, clustering2)

This function calculates the geometric accuracy between two (overlapping) clusterings.

See [NYP12] for a detailed derivation and explaination of the measure.

Parameters:
  • clustering1 (Clustering) – The first clustering.
  • clustering2 (Clustering) – The second clustering.
Returns:

the geometric accuracy

clusim.sim.overlap_quality(clustering1, clustering2)

This function calculates the overlap quality between two (overlapping) clusterings.

See [ABL10] for a detailed derivation and explaination of the measure.

Parameters:
  • clustering1 (Clustering) – The first clustering.
  • clustering2 (Clustering) – The second clustering.
Returns:

the overlap quality

Element-centric Clustering Similarity

clusim.clusimelement.element_sim(clustering1, clustering2, alpha=0.9, r=1.0, r2=None, rescale_path_type='max')

The element-centric clustering similarity.

See [GWHA18] for a detailed explaination of the measure.

Parameters:
  • clustering1 (Clustering) – The first Clustering
  • clustering2 (Clustering) – The second Clustering
  • alpha (float) – The personalized page-rank return probability as a float in [0,1].
  • r1 (float) – The hierarchical scaling parameter for clustering1.
  • r2 (float) – The hierarchical scaling parameter for clustering2. This defaults to None forcing r2 = r1
  • rescale_path_type (str) – rescale the hierarchical height by ‘max’ : the maximum path from the root ‘min’ : the minimum path form the root ‘linkage’ : use the linkage distances in the clustering
  • relabeled_elements (dict) – (optional) The elements maped to indices of the affinity matrix.
Returns:

The element-wise similarity between the two clusterings

>>> import clusim.sim as sim
>>> from clusim.clustering import Clustering
>>> clustering1 = Clustering(elm2clu_dict={0:[0], 1:[0], 2:[0,1],
                                           3:[1], 4:[2], 5:[2]})
>>> clustering2 = Clustering(elm2clu_dict={0:[0,2], 1:[0], 2:[0,1],
                                           3:[1], 4:[2], 5:[1,2]})
>>> print(sim.element_sim(clustering1, clustering2, alpha=0.9))
clusim.clusimelement.element_sim_elscore(clustering1, clustering2, alpha=0.9, r=1.0, r2=None, rescale_path_type='max', relabeled_elements=None)

The element-centric clustering similarity for each element.

See [GWHA18] for a detailed explaination of the measure.

Parameters:
  • clustering1 (Clustering) – The first Clustering
  • clustering2 (Clustering) – The second Clustering
  • alpha (float) – The personalized page-rank return probability as a float in [0,1].
  • r1 (float) – The hierarchical scaling parameter for clustering1.
  • r2 (float) – The hierarchical scaling parameter for clustering2. This defaults to None forcing r2 = r1
  • rescale_path_type (str) – rescale the hierarchical height by: ‘max’ : the maximum path from the root ‘min’ : the minimum path form the root ‘linkage’ : use the linkage distances in the clustering
  • relabeled_elements (dict) – (optional) The elements maped to indices of the affinity matrix.
Returns:

The element-centric similarity between the two clusterings for each element as a 1d numpy array

Returns:

a dict mapping each element to its index of the elementScores array.

>>> import clusim.sim as sim
>>> from clusim.clustering import Clustering
>>> clustering1 = Clustering(elm2clu_dict={0:[0], 1:[0], 2:[0,1],
                                           3:[1], 4:[2], 5:[2]})
>>> clustering2 = Clustering(elm2clu_dict={0:[0,2], 1:[0], 2:[0,1],
                                           3:[1], 4:[2], 5:[1,2]})
>>> elementScores, relabeled_elements = sim.element_sim_elseq(clustering1,
                                                          clustering2,
                                                          alpha = 0.9)
>>> print(elementScores)
clusim.clusimelement.cL1(x, y, alpha)

The normalized similarity value based on the L1 probabilty metric corrected for the guaranteed overlap in probability between the two vectors, alpha.

See [GWHA18] for a detailed explaination of the need to correct the L1 metric.

Parameters:
  • x (2d-numpy-array) – The first list of probability vectors
  • y (2d-numpy-array) – The second list of probability vectors
  • alpha (float) – The guaranteed overlap in probability between the two vectors in [0,1].
Returns:

The 1d numpy array of L1 similarities between the affinity matrices x and y

clusim.clusimelement.make_affinity_matrix(clustering, alpha=0.9, r=1.0, rescale_path_type='max', relabeled_elements=None)

The element-centric clustering similarity affinity matrix for a clustering. This function automatically determines the most efficient method to calculate the affinity matrix.

See [GWHA18] for a detailed explaination of the affinity matrix.

Parameters:
  • clustering (Clustering) – The clustering
  • alpha (float) – The personalized page-rank return probability.
  • relabeled_elements (dict) – (optional) The elements maped to indices of the affinity matrix.
Returns:

The element-centric affinity representation of the clustering as a 2d numpy array

>>> import clusim.sim as sim
>>> from clusim.clustering import Clustering
>>> clustering1 = Clustering(elm2clu_dict={0:[0], 1:[0], 2:[1], 3:[1],
                                           4:[2], 5:[2]})
>>> pprmatrix = sim.make_affinity_matrix(clustering1, alpha=0.9)
>>> print(pprmatrix)
>>> clustering2 = Clustering(elm2clu_dict={0:[0], 1:[0], 2:[0,1], 3:[1],
                                           4:[2], 5:[2]})
>>> pprmatrix2 = sim.make_affinity_matrix(clustering2, alpha=0.9)
>>> print(pprmatrix2)
clusim.clusimelement.ppr_partition(clustering, alpha=0.9, relabeled_elements=None)

The element-centric clustering similarity affinity matrix for a partition found analytically.

Parameters:
  • clustering (Clustering) – The Clustering
  • alpha (float) – The personalized page-rank return probability as a float in [0,1].
  • relabeled_elements (dict) – (optional) The elements maped to indices of the affinity matrix.
Returns:

2d numpy array The element-centric affinity representation of the clustering

>>> import clusim.sim as sim
>>> from clusim.clustering import Clustering
>>> elm2clu_dict = {0:[0], 1:[0], 2:[1], 3:[1], 4:[2], 5:[2]}
>>> clustering1 = Clustering(elm2clu_dict=elm2clu_dict)
>>> pprmatrix = sim.ppr_partition(clustering1, alpha=0.9)
>>> print(pprmatrix)
clusim.clusimelement.make_cielg(clustering, r=1.0, rescale_path_type='max', relabeled_elements=None)

Create the cluster-induced element graph for a Clustering.

Parameters:
  • clustering (Clustering) – The clustering
  • r (float) – The hierarchical scaling parameter.
  • rescale_path_type (str) – rescale the hierarchical height by: ‘max’ : the maximum path from the root ‘min’ : the minimum path form the root ‘linkage’ : use the linkage distances in the clustering
  • relabeled_elements (dict) – (optional) The elements maped to indices of the affinity matrix.
Returns:

The cluster-induced element graph for a Clustering as an igraph.WeightedGraph

>>> import clusim.sim as sim
>>> from clusim.clustering import Clustering
>>> elm2clu_dict = {0:[0], 1:[0], 2:[1], 3:[1], 4:[2], 5:[2]}
>>> clustering1 = Clustering(elm2clu_dict=elm2clu_dict)
>>> pprmatrix = sim.make_cielg(clustering1, r = 1.0)
>>> print(pprmatrix)
clusim.clusimelement.find_groups_in_cluster(clustervs, elementgroupList)

A utility function to find vertices with the same cluster memberships.

Parameters:
  • clustervs (igraph.vertex) – an igraph vertex instance
  • elementgroupList (list) – a list containing the vertices to group
Returns:

a list-of-lists containing the groupings of the vertices

clusim.clusimelement.numerical_ppr_scores(cielg, clustering, alpha=0.9, relabeled_elements=None)

The element-centric clustering similarity affinity matrix for a partition.

Parameters:
  • cielg (igraph.WeightedGraph) – cielg : An igraph Weighted Graph representation of the cluster-induced element graph
  • clustering (Clustering) – The Clustering
  • alpha (float) – The personalized page-rank return probability as a float in [0,1].
  • relabeled_elements (dict) – (optional) dict The elements maped to indices of the affinity matrix.
Returns:

2d numpy array The element-centric affinity representation of the clustering

DAG and Dendrogram

class clusim.dag.DAG(incoming_graph_data=None, **attr)
class clusim.dag.Dendrogram(incoming_graph_data=None, **attr)

References

[CD88]Linda M. Collins and Clyde W. Dent. Omega: a general formulation of the rand index of cluster recovery suitable for non-disjoint solutions. Multivariate Behavioral Research, 23(2):231–242, 1988.
[DDiazGDA05]Leon Danon, Albert D ıaz-Guilera, Jordi Duch, and Alex Arenas. Comparing community structure identification. Journal of Statistical Mechanics: Theory and Experiment, 2005(09):P09008–P09008, September 2005.
[FM83]Edward B. Fowlkes and Colin L. Mallows. A method for comparing two hierarchical clusterings. Journal of the American Statistical Association, 78(383):553–569, 1983.
[GA17](1, 2, 3, 4, 5) Alexander J. Gates and Yong-Yeol Ahn. The impact of random models on clustering similarity. Journal of Machine Learning Research, 18(87):1–28, 2017.
[GWHA18](1, 2, 3, 4) Alexander J. Gates, Ian B. Wood, William P. Hetrick, and Yong-Yeol Ahn. On comparing clusterings: an element-centric framework unifies overlaps and hierarchy. arxiv, pages 1706.06136, 2018.
[HA85](1, 2) Lawrence Hubert and Phipps Arabie. Comparing partitions. Journal of Classification, 2(1):193–218, December 1985.
[Jac12]Paul Jaccard. The distribution of the flora in the alpine zone. New Phytologist, 11(2):37–50, 1912.
[LFK09]Andrea Lancichinetti, Santo Fortunato, and János Kertész. Detecting the overlapping and hierarchical community structure in complex networks. New Journal of Physics, 11(3):033015, 3 2009.
[Mei03]Marina Meilă. Comparing clusterings by the variation of information. In Learning Theory and Kernel Machines, pages 173–187. Springer, 2003.
[NYP12]Tamás Nepusz, Haiyuan Yu, and Alberto Paccanaro. Detecting overlapping protein complexes in protein-protein interaction networks. Nature Methods, 9(5):471–472, 2012.
[Ran71]William M Rand. Objective Criteria for the Evaluation of Clustering Methods. Journal of the American Statistical Association, 66(336):846, 1971.