Welcome to CluSim’s documentation!
This package (will be) available in PyPI. Just run the following command on terminal to install.
>>> pip install clusim
You can also source the code directly from the github [project page](https://github.com/Hoosier-Clusters/clusim).
We start by importing the required modules
>>> from clusim.clustering import Clustering, print_clustering
>>> import clusim.sim as sim
The simplest way to make a Clustering is to use an elm2clu_dict which maps each element.
>>> c1 = Clustering(elm2clu_dict = {0:[0], 1:[0], 2:[1], 3:[1], 4:[2], 5:[2]})
>>> c2 = Clustering(elm2clu_dict = {0:[0], 1:[1], 2:[1], 3:[1], 4:[1], 5:[2]})
>>> print_clustering(c1)
>>> print_clustering(c2)
Finally, the similarity of the two Clusterings can be found using the Jaccard Index.
>>> sim.jaccard_index(c1, c2)
>>> from clusim.clustering import Clustering, print_clustering
>>> import clusim.sim as sim
>>> c1 = Clustering(elm2clu_dict = {0:[0], 1:[0], 2:[1], 3:[1], 4:[2], 5:[2]})
>>> c2 = Clustering(elm2clu_dict = {0:[0], 1:[1], 2:[1], 3:[1], 4:[1], 5:[2]})
The basic element-centric similarity score with a fixed alpha:
>>> sim.element_sim(c1, c2, alpha = 0.9)
We can also get the element scores. Note that since non-numberic elements are allowed, the element scores returns a dict which maps the elements to the index in the elementScore array.
>>> elementScores, relabeled_elements = sim.element_sim_elscore(c1, c2, alpha = 0.9)
clusim.clustering.Clustering(elm2clu_dict=None, clu2elm_dict=None, hier_graph=None)¶Base class for clusterings.
| Parameters: |
|
|---|
copy()¶Return a deep copy of the clustering.
| Returns: | deep copy of the clustering |
|---|
>>> from clusim.clustering import Clustering, print_clustering
>>> clu = clusim.Clustering()
>>> clu2 = clu.copy()
>>> print_clustering(clu)
>>> print_clustering(clu)
from_elm2clu_dict(elm2clu_dict)¶This method creates a clustering from an elm2clu_dict dictionary: { elementid: [clu1, clu2, … ] } where each element is a key with value a list of clusters to which it belongs. Clustering features are then calculated.
| Parameters: | elm2clu_dict (dict) – { elementid: [clu1, clu2, … ] } |
|---|
>>> from clusim.clustering import Clustering, print_clustering
>>> elm2clu_dict = {0:[0], 1:[0], 2:[0,1], 3:[1], 4:[2], 5:[2]}
>>> clu = Clustering()
>>> clu.from_elm2clu_dict(elm2clu_dict)
>>> print_clustering(clu)
from_clu2elm_dict(clu2elm_dict)¶This method creates a clustering from an clu2elm_dict dictionary: { clusid: [el1, el22, … ] } where each cluster is a key with value a list of elements which belong to it. Clustering features are then calculated.
| Parameters: | clu2elm_dict (dict) – { clusid: [el1, el2, … ] } |
|---|
>>> from clusim.clustering import Clustering, print_clustering
>>> clu2elm_dict = {0:[0,1,2], 1:[2,3], 2:[4,5]}
>>> clu = Clustering()
>>> clu.from_clu2elm_dict(clu2elm_dict)
>>> print_clustering(clu)
from_cluster_list(cluster_list)¶This method creates a clustering from a cluster list: [ [el1, el2, …], [el5, …], … ], a list of lists, where each inner list corresponds to the elements in a cluster. Clustering features are then calculated.
| Parameters: | cluster_list (list) – list of lists [ [el1, el2, …], [el5, …], … ] |
|---|
>>> from clusim.clustering import Clustering, print_clustering
>>> cluster_list = [ [0,1,2], [2,3], [4,5]]
>>> clu = Clustering()
>>> clu.from_cluster_list(cluster_list)
>>> print_clustering(clu)
to_cluster_list()¶This method returns a clustering in cluster list format: [ [el1, el2, …], [el5, …], … ], a list of lists, where each inner list corresponds to the elements in a cluster.
| Returns: | cluster_list : list of lists, [ [el1, el2, …], [el5, …], … ] |
|---|
from_membership_list(membership_list)¶This method creates a clustering from a membership list: [ clu_for_el1, clu_for_el2, … ], a list of cluster names where the ith entry corresponds to the cluster membership of the ith element. Clustering features are then calculated.
Note
Membership Lists can only represent partitions (no overlaps)
| Parameters: | membership_list (list) – list of cluster names clu_for_el1, clu_for_el2, … ] |
|---|
>>> from clusim.clustering import Clustering, print_clustering
>>> membership_list = [0,0,0,1,2,2]
>>> clu = Clustering()
>>> clu.from_membership_list(membership_list)
>>> print_clustering(clu)
to_membership_list()¶This method returns the clustering as a membership list: [ clu_for_el1, clu_for_el2, … ], a list of cluster names the ith entry corresponds to the cluster membership of the ith element.
Note
Membership Lists can only represent partitions (no overlaps)
| Returns: | list of element memberships, [ clu_for_el1, clu_for_el2, … ] |
|---|
clustering_from_igraph_cover(igraphcover)¶This method creates a clustering from an igraph VertexCover object.
See the igraph.Cover.VertexCover class.
Clustering features are then calculated.
| Parameters: | igraphcover (igraph.Cover.VertexCover) – the igraph VertexCover |
|---|
to_clu2elm_dict()¶Create a clu2elm_dict: {clusterid: [el1, el2, … ]} from the stored elm2clu_dict.
| Returns: | dict |
|---|
to_elm2clu_dict()¶Create a elm2clu_dict: {elementid: [clu1, clu2, … ]} from the stored clu2elm_dict.
| Returns: | dict |
|---|
find_clu_size_seq()¶This method finds the cluster size sequence for the clustering.
| Returns: | list of integers A list where the ith entry corresponds to the size of the ith cluster. |
|---|
>>> from clusim.clustering import Clustering, print_clustering
>>> elm2clu_dict = {0:[0], 1:[0], 2:[0,1], 3:[1], 4:[2], 5:[2]}
>>> clu = Clustering(elm2clu_dict = elm2clu_dict)
>>> print("Cluster Size Sequence:", clu.find_clu_size_seq())
* Cluster Size Sequence: [3, 2, 2]
find_num_overlap()¶This method finds the number of elements which are in more than one cluster in the clustering.
| Returns: | The number of elements in at least two clusters. |
|---|
>>> from clusim.clustering import Clustering, print_clustering
>>> elm2clu_dict = {0:[0], 1:[0], 2:[0,1], 3:[1], 4:[2], 5:[2]}
>>> clu = Clustering(elm2clu_dict = elm2clu_dict)
>>> print("Overlap size:", clu.find_num_overlap())
* Overlap size: 1
merge_clusters(c1, c2, new_name=None)¶This method merges the elements in two clusters from the clustering. The merged clustering will be named new_name if provided, otherwise it will assume the name of cluster c1.
| Returns: | self |
|---|
>>> from clusim.clustering import Clustering, print_clustering
>>> elm2clu_dict = {0:[0], 1:[0], 2:[0], 3:[1], 4:[2], 5:[2]}
>>> clu = Clustering(elm2clu_dict = elm2clu_dict)
>>> print_clustering(clu)
>>> clu.merge_clusters(1,2, new_name = 3)
>>> print_clustering(clu)
downstream_elements(cluster)¶This method finds all elements contained in a cluster from a hierarchical clustering by visiting all downstream clusters and adding their elements.
| Parameters: | cluster – the name of the parent cluster |
|---|---|
| Returns: | element list |
from_scipy_linkage(linkage_matrix, dist_rescaled=False)¶This method creates a clustering from a scipy linkage object resulting from the agglomerative hierarchical clustering. Clustering features are then calculated.
| Parameters: |
|
|---|
>>> from clusim.clustering import Clustering, print_clustering
>>> from scipy.cluster.hierarchy import dendrogram, linkage
>>> import numpy as np
>>> np.random.seed(42)
>>> data1 = np.random.multivariate_normal([10, 0], [[3, 1], [1, 4]],
size=[100,])
>>> data2 = np.random.multivariate_normal([0, 20], [[3, 1], [1, 4]],
size=[50,])
>>> Xdata = np.concatenate((data1, data2), )
>>> Z = linkage(X, 'ward')
>>> clu = Clustering()
>>> clu.from_scipy_linkage(Z, dist_rescaled=False)
clusim.clustering.ClusterError(expression, message)¶clusim.clustering.Clustering(elm2clu_dict=None, clu2elm_dict=None, hier_graph=None)Base class for clusterings.
| Parameters: |
|
|---|
copy()Return a deep copy of the clustering.
| Returns: | deep copy of the clustering |
|---|
>>> from clusim.clustering import Clustering, print_clustering
>>> clu = clusim.Clustering()
>>> clu2 = clu.copy()
>>> print_clustering(clu)
>>> print_clustering(clu)
from_elm2clu_dict(elm2clu_dict)This method creates a clustering from an elm2clu_dict dictionary: { elementid: [clu1, clu2, … ] } where each element is a key with value a list of clusters to which it belongs. Clustering features are then calculated.
| Parameters: | elm2clu_dict (dict) – { elementid: [clu1, clu2, … ] } |
|---|
>>> from clusim.clustering import Clustering, print_clustering
>>> elm2clu_dict = {0:[0], 1:[0], 2:[0,1], 3:[1], 4:[2], 5:[2]}
>>> clu = Clustering()
>>> clu.from_elm2clu_dict(elm2clu_dict)
>>> print_clustering(clu)
from_clu2elm_dict(clu2elm_dict)This method creates a clustering from an clu2elm_dict dictionary: { clusid: [el1, el22, … ] } where each cluster is a key with value a list of elements which belong to it. Clustering features are then calculated.
| Parameters: | clu2elm_dict (dict) – { clusid: [el1, el2, … ] } |
|---|
>>> from clusim.clustering import Clustering, print_clustering
>>> clu2elm_dict = {0:[0,1,2], 1:[2,3], 2:[4,5]}
>>> clu = Clustering()
>>> clu.from_clu2elm_dict(clu2elm_dict)
>>> print_clustering(clu)
from_cluster_list(cluster_list)This method creates a clustering from a cluster list: [ [el1, el2, …], [el5, …], … ], a list of lists, where each inner list corresponds to the elements in a cluster. Clustering features are then calculated.
| Parameters: | cluster_list (list) – list of lists [ [el1, el2, …], [el5, …], … ] |
|---|
>>> from clusim.clustering import Clustering, print_clustering
>>> cluster_list = [ [0,1,2], [2,3], [4,5]]
>>> clu = Clustering()
>>> clu.from_cluster_list(cluster_list)
>>> print_clustering(clu)
to_cluster_list()This method returns a clustering in cluster list format: [ [el1, el2, …], [el5, …], … ], a list of lists, where each inner list corresponds to the elements in a cluster.
| Returns: | cluster_list : list of lists, [ [el1, el2, …], [el5, …], … ] |
|---|
from_membership_list(membership_list)This method creates a clustering from a membership list: [ clu_for_el1, clu_for_el2, … ], a list of cluster names where the ith entry corresponds to the cluster membership of the ith element. Clustering features are then calculated.
Note
Membership Lists can only represent partitions (no overlaps)
| Parameters: | membership_list (list) – list of cluster names clu_for_el1, clu_for_el2, … ] |
|---|
>>> from clusim.clustering import Clustering, print_clustering
>>> membership_list = [0,0,0,1,2,2]
>>> clu = Clustering()
>>> clu.from_membership_list(membership_list)
>>> print_clustering(clu)
to_membership_list()This method returns the clustering as a membership list: [ clu_for_el1, clu_for_el2, … ], a list of cluster names the ith entry corresponds to the cluster membership of the ith element.
Note
Membership Lists can only represent partitions (no overlaps)
| Returns: | list of element memberships, [ clu_for_el1, clu_for_el2, … ] |
|---|
clustering_from_igraph_cover(igraphcover)This method creates a clustering from an igraph VertexCover object.
See the igraph.Cover.VertexCover class.
Clustering features are then calculated.
| Parameters: | igraphcover (igraph.Cover.VertexCover) – the igraph VertexCover |
|---|
to_clu2elm_dict()Create a clu2elm_dict: {clusterid: [el1, el2, … ]} from the stored elm2clu_dict.
| Returns: | dict |
|---|
to_elm2clu_dict()Create a elm2clu_dict: {elementid: [clu1, clu2, … ]} from the stored clu2elm_dict.
| Returns: | dict |
|---|
find_clu_size_seq()This method finds the cluster size sequence for the clustering.
| Returns: | list of integers A list where the ith entry corresponds to the size of the ith cluster. |
|---|
>>> from clusim.clustering import Clustering, print_clustering
>>> elm2clu_dict = {0:[0], 1:[0], 2:[0,1], 3:[1], 4:[2], 5:[2]}
>>> clu = Clustering(elm2clu_dict = elm2clu_dict)
>>> print("Cluster Size Sequence:", clu.find_clu_size_seq())
* Cluster Size Sequence: [3, 2, 2]
find_num_overlap()This method finds the number of elements which are in more than one cluster in the clustering.
| Returns: | The number of elements in at least two clusters. |
|---|
>>> from clusim.clustering import Clustering, print_clustering
>>> elm2clu_dict = {0:[0], 1:[0], 2:[0,1], 3:[1], 4:[2], 5:[2]}
>>> clu = Clustering(elm2clu_dict = elm2clu_dict)
>>> print("Overlap size:", clu.find_num_overlap())
* Overlap size: 1
merge_clusters(c1, c2, new_name=None)This method merges the elements in two clusters from the clustering. The merged clustering will be named new_name if provided, otherwise it will assume the name of cluster c1.
| Returns: | self |
|---|
>>> from clusim.clustering import Clustering, print_clustering
>>> elm2clu_dict = {0:[0], 1:[0], 2:[0], 3:[1], 4:[2], 5:[2]}
>>> clu = Clustering(elm2clu_dict = elm2clu_dict)
>>> print_clustering(clu)
>>> clu.merge_clusters(1,2, new_name = 3)
>>> print_clustering(clu)
downstream_elements(cluster)This method finds all elements contained in a cluster from a hierarchical clustering by visiting all downstream clusters and adding their elements.
| Parameters: | cluster – the name of the parent cluster |
|---|---|
| Returns: | element list |
from_scipy_linkage(linkage_matrix, dist_rescaled=False)This method creates a clustering from a scipy linkage object resulting from the agglomerative hierarchical clustering. Clustering features are then calculated.
| Parameters: |
|
|---|
>>> from clusim.clustering import Clustering, print_clustering
>>> from scipy.cluster.hierarchy import dendrogram, linkage
>>> import numpy as np
>>> np.random.seed(42)
>>> data1 = np.random.multivariate_normal([10, 0], [[3, 1], [1, 4]],
size=[100,])
>>> data2 = np.random.multivariate_normal([0, 20], [[3, 1], [1, 4]],
size=[50,])
>>> Xdata = np.concatenate((data1, data2), )
>>> Z = linkage(X, 'ward')
>>> clu = Clustering()
>>> clu.from_scipy_linkage(Z, dist_rescaled=False)
clusim.clustering.ClusterError(expression, message)clusim.clustering.print_clustering(clustering)¶A function to print a clustering. Clusters are seperated by ‘|’. The fuction will only print the leaf layer of a Hierarchical Clustering.
| Parameters: | clustering (Clsutering) – The clustering to print |
|---|
>>> import clusim.clugen as clugen
>>> from clusim.clustering import print_clustering
>>> clu = clugen.make_equal_clustering(n_elements = 9, n_clusters = 3)
>>> print_clustering(clu)
clusim.clugen.make_equal_clustering(n_elements, n_clusters)¶This function creates a random clustering with equally sized clusters. If n_elements % n_clusters != 0, cluster sizes will differ by one element.
| Parameters: |
|
|---|---|
| Returns: | The new clustering with equally sized clusters. |
>>> import clusim.clugen as clugen
>>> from clusim.clustering import print_clustering
>>> clu = clugen.make_equal_clustering(n_elements = 9, n_clusters = 3)
>>> print_clustering(clu)
clusim.clugen.make_random_clustering(n_elements=1, n_clusters=1, clu_size_seq=[1, 2], random_model='all', tol=1e-15)¶This function creates a random clustering according to one of three random models. It is a wrapper around the specific functions for each random model.
| Parameters: |
|
|---|---|
| Returns: | The new clustering. |
>>> import clusim.clugen as clugen
>>> from clusim.clustering import print_clustering
>>> clu = clugen.make_random_clustering(n_elements = 9, n_clusters = 3,
random_model = 'num')
>>> print_clustering(clu)
clusim.clugen.make_singleton_clustering(n_elements)¶This function creates a clustering with each element in its own cluster.
| Parameters: | n_elements (int) – The number of elements |
|---|---|
| Returns: | The new clustering. |
>>> import import clusim.clugen as clugen
>>> from clusim.clustering import print_clustering
>>> clu = clugen.make_singleton_clustering(n_elements = 9)
>>> print_clustering(clu)
clusim.clugen.make_random_dendrogram(n_elements)¶This function creates a random Hierarchical Clustering.
:param int n_elements The number of elements
| Returns: | The new clustering. |
|---|
clusim.clugen.shuffle_memberships(clustering, percent=1.0)¶This function creates a new clustering by shuffling the element memberships from the original clustering.
| Parameters: |
|
|---|---|
| Returns: | The new clustering. |
>>> import clusim.clugen as clugen
>>> from clusim.clustering import print_clustering
>>> orig_clu = clugen.make_random_clustering(n_elements = 9, n_clusters = 3,
random_model = 'num')
>>> print_clustering(orig_clu)
>>> shuffle_clu = clugen.shuffle_memberships(orig_clu, percent = 0.5)
>>> print_clustering(shuffle_clu)
clusim.clugen.shuffle_memberships_pa(clustering, n_steps=1, constant_num_clusters=True)¶This function creates a new clustering by shuffling the element memberships from the original clustering according to the preferential attachment model.
See [GA17] for a detailed explaination of the preferential attachment model.
| Parameters: |
|
|---|---|
| Returns: | The new clustering with shuffled memberships. |
>>> import clusim.clugen as clugen
>>> from clusim.clustering import print_clustering
>>> orig_clu = clugen.make_random_clustering(n_elements=9, n_clusters=3,
random_model='num')
>>> print_clustering(orig_clu)
>>> shuffle_clu = clugen.shuffle_memberships_pa(orig_clu, n_steps=10,
constant_num_clusters=True)
>>> print_clustering(shuffle_clu)
clusim.clugen.generate_random_partition_all(n_elements, tol=1e-15)¶This function creates a random clustering according to the ‘All’ random model by uniformly selecting a clustering from the ensemble of all clusterings with n_elements.
| Parameters: |
|
|---|---|
| Returns: | The randomly genderated clustering. |
>>> import clusim.clugen as clugen
>>> from clusim.clustering import print_clustering
>>> clu = clugen.generate_random_partition_all(n_elements = 9)
>>> print_clustering(clu)
clusim.clugen.enumerate_random_partition_num(n_elements, n_clusters)¶A generator for every partition in ‘Num’, the ensemble of all clusterings with n_elements grouped into n_clusters, non-empty clusters.
Based on the solution provided by Adeel Zafar Soomro: a link.
which was itself based on the algorithm from Knuth: (Algorithm U) is described by Knuth in the Art of Computer Programming, Volume 4, Fascicle 3B
| Parameters: |
|
|---|---|
| Returns: | The new clustering as a cluster list. |
>>> import clusim.clugen as clugen
>>> from clusim.clustering import print_clustering
>>> for clu in clugen.clustering_ensemble_generator_num(n_elements=5, n_clusters=3):
>>> print_clustering(clu)
The different clustering similarity measures available.
clusim.sim.contingency_table(clustering1, clustering2)¶This function creates the contigency table between two clusterings.
| Parameters: |
|
|---|---|
| Returns: | The clustering1.n_clusters by clustering2.n_clusters contigency table as a list of lists |
>>> import clusim.clugen as clugen
>>> clustering1 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
random_model = 'num')
>>> clustering2 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
random_model = 'num')
>>> cont_table = contingency_table(clustering1, clustering2)
>>> print(cont_table)
clusim.sim.count_pairwise_cooccurence(clustering1, clustering2)¶This function finds the pairwise cooccurence counts between two clusterings.
| Parameters: |
|
|---|---|
| Returns: |
|
>>> import clusim.clugen as clugen
>>> import clusim.sim as sim
>>> from clusim.clustering import print_clustering
>>> clustering1 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
random_model='num')
>>> clustering2 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
random_model='num')
>>> N11, N10, N01, N00 = sim.count_pairwise_cooccurence(clustering1,
clustering2)
>>> print_clustering(clustering1)
>>> print_clustering(clustering2)
>>> print(N11,
"element pairs assigned to the same clusters in both clusterings")
>>> print(N10,
"element pairs assigned to the same clusters in clustering1, but "
"different clusters in clustering2")
>>> print(N01, "element pairs assigned to different clusters in "
"clustering1, but the same clusters in clustering2")
>>> print(N00, "element pairs assigned to different clusters in both "
"clusterings")
clusim.sim.jaccard_index(clustering1, clustering2)¶This function calculates the Jaccard index between two clusterings [Jac12].
J = N11/(N11+N10+N01)
| Parameters: |
|
|---|---|
| Returns: | The Jaccard index (between 0.0 and 1.0) |
>>> import clusim.clugen as clugen
>>> import clusim.sim as sim
>>> clustering1 = clugen..make_random_clustering(n_elements=9,
n_clusters=3,
random_model='num')
>>> clustering2 = clugen.make_random_clustering(n_elements=9,
n_clusters=3,
random_model='num')
>>> print(sim.jaccard_index(clustering1, clustering2) )
clusim.sim.rand_index(clustering1, clustering2)¶This function calculates the Rand index between two clusterings [Ran71].
RI = (N11 + N00) / (N11 + N10 + N01 + N00)
| Parameters: |
|
|---|---|
| Returns: | The Rand index (between 0.0 and 1.0) |
>>> import clusim.clugen as clugen
>>> import clusim.sim as sim
>>> clustering1 = clugen..make_random_clustering(n_elements=9, n_clusters=3,
random_model='num')
>>> clustering2 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
random_model='num')
>>> print(sim.rand_index(clustering1, clustering2))
clusim.sim.fowlkes_mallows_index(clustering1, clustering2)¶This function calculates the Fowlkes and Mallows index between two clusterings [FM83].
FM = N11 / sqrt( (N11 + N10) * (N11 + N01) )
| Parameters: |
|
|---|---|
| Returns: | The Fowlkes and Mallows index (between 0.0 and 1.0) |
>>> import clusim.clugen as clugen
>>> import clusim.sim as sim
>>> clustering1 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
random_model='num')
>>> clustering2 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
random_model='num')
>>> print(sim.fowlkes_mallows_index(clustering1, clustering2))
clusim.sim.fmeasure(clustering1, clustering2)¶This function calculates the F-measure between two clusterings.
Also known as: Czekanowski index Dice Symmetric index Sorensen index
F = 2*N11 / (2*N11 + N10 + N01)
| Parameters: |
|
|---|---|
| Returns: | The F-measure (between 0.0 and 1.0) |
>>> import clusim.clugen as clugen
>>> import clusim.sim as sim
>>> clustering1 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
random_model='num')
>>> clustering2 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
random_model='num')
>>> print(sim.fmeasure(clustering1, clustering2))
clusim.sim.purity_index(clustering1, clustering2)¶This function calculates the Purity index between two clusterings.
| Parameters: |
|
|---|---|
| Returns: | The Purity index (between 0.0 and 1.0) |
>>> import clusim.clugen as clugen
>>> import clusim.sim as sim
>>> clustering1 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
random_model='num')
>>> clustering2 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
random_model='num')
>>> print(sim.purity_index(clustering1, clustering2))
clusim.sim.classification_error(clustering1, clustering2)¶This function calculates the Jaccard index between two clusterings.
CE = 1 - PI
| Parameters: |
|
|---|---|
| Returns: | The Classification Error (between 0.0 and 1.0) |
Note
CE is a distance measure, it is 0 for identical clusterings
>>> import clusim.clugen as clugen
>>> import clusim.sim as sim
>>> clustering1 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
random_model='num')
>>> clustering2 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
random_model='num')
>>> print(sim.classification_error(clustering1, clustering2))
clusim.sim.czekanowski_index(clustering1, clustering2)¶This function calculates the Czekanowski index between two clusterings.
See Fmeasure
clusim.sim.dice_index(clustering1, clustering2)¶This function calculates the Dice index between two clusterings.
See Fmeasure
clusim.sim.sorensen_index(clustering1, clustering2)¶This function calculates the Sorensen index between two clusterings.
See Fmeasure
clusim.sim.rogers_tanimoto_index(clustering1, clustering2)¶This function calculates the Rogers and Tanimoto index between two clusterings.
RT = (N11 + N00)/(N11 + 2*(N10+N01) + N00)
| Parameters: |
|
|---|---|
| Returns: | The Rogers and Tanimoto index (between 0.0 and 1.0) |
>>> import clusim.clugen as clugen
>>> import clusim.sim as sim
>>> clustering1 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
random_model='num')
>>> clustering2 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
random_model='num')
>>> print(sim.rogers_tanimoto_index(clustering1, clustering2))
clusim.sim.southwood_index(clustering1, clustering2)¶calculate the southwood index
N11 / (N10 + N01)
| Parameters: |
|
|---|---|
| Returns: | The Southwood index (between 0 and 1.0) |
>>> import clusim.clugen as clugen
>>> import clusim.sim as sim
>>> clustering1 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
random_model='num')
>>> clustering2 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
random_model='num')
>>> print(sim.southwood_index(clustering1, clustering2))
clusim.sim.pearson_correlation(clustering1, clustering2)¶This function calculates the Pearson Correlation between two clusterings.
PC = (N11*N00 - N01*N10) / ((N11+N10) * (N11+N01) * (N00+N10) * (N00+N01))
| Parameters: |
|
|---|---|
| Returns: | The Pearson Correlation (between -1.0 and 1.0) |
>>> import clusim.clugen as clugen
>>> import clusim.sim as sim
>>> clustering1 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
random_model='num')
>>> clustering2 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
random_model='num')
>>> print(sim.pearson_correlation(clustering1, clustering2))
clusim.sim.nmi(clustering1, clustering2, norm_type='sum')¶This function calculates the Normalized Mutual Information (NMI) between two clusterings [DDiazGDA05].
NMI = (S(c1) + S(c2) - S(c1, c2)) / norm(c1, c2)
where S(c1) is the Shannon Entropy of the clustering size distrubtion, S(c1, c2) is the Shannon Entropy of the join clustering size distrubtion, and norm(c1,c2) is a normalizetion term.
| Parameters: |
|
|---|---|
| Returns: | The Normalized Mutual Information index (between 0.0 and inf) |
>>> import clusim.clugen as clugen
>>> import clusim.sim as sim
>>> clustering1 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
random_model='num')
>>> clustering2 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
random_model='num')
>>> print(sim.nmi(clustering1, clustering2, norm_type='sum'))
>>> print(sim.nmi(clustering1, clustering2, norm_type='max'))
>>> print(sim.nmi(clustering1, clustering2, norm_type='min'))
>>> print(sim.nmi(clustering1, clustering2, norm_type='sqrt'))
clusim.sim.vi(clustering1, clustering2, norm_type='none')¶This function calculates the Variation of Information (VI) between two clusterings [Mei03].
VI is technically a distance measure and can assume values in the range [0, inf), where 0 denotes identical clusterings.
VI = 2*S(c1, c2) - S(c1) - S(c2)
where S(c1) is the Shannon Entropy of the clustering size distrubtion, and S(c1, c2) is the Shannon Entropy of the join clustering size distrubtion.
The VI can be transformed into a clustering similarity measure via the appropraite normalization.
VI_{sim} = 1 - 0.5*((S(c1,c2) - S(c1))/S(c2) + (S(c1,c2) - S(c2))/S(c1))
| Parameters: |
|
|---|
| Returns: | The Variation of Information index (between 0.0 and inf) |
|---|
>>> import clusim.clugen as clugen
>>> import clusim.sim as sim
>>> clustering1 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
random_model='num')
>>> clustering2 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
random_model='num')
>>> print(sim.vi(clustering1, clustering2))
clusim.sim.corrected_chance(clustering1, clustering2, measure='jaccard_index', random_model='perm', norm_type='sum', n_samples=100)¶This function calculates the adjusted Similarity for one of six random models.
Note
Clustering 2 is considered the gold-standard clustering for one-sided expectations
| Parameters: |
|
|---|
| Returns: | The adjusted Similarity measure |
|---|
>>> import clusim.clugen as clugen
>>> import clusim.sim as sim
>>> clustering1 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
random_model='all')
>>> clustering2 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
random_model='all')
>>> print(sim.corrected_chance(clustering1, clustering2, measure='jaccard_index',
random_model='all'))
>>> print(sim.corrected_chance(clustering1, clustering2, measure='jaccard_index',
random_model='all1'))
>>> print(sim.corrected_chance(clustering1, clustering2, measure='jaccard_index',
random_model='num'))
>>> print(sim.corrected_chance(clustering1, clustering2, measure='jaccard_index',
random_model='num1'))
>>> print(sim.corrected_chance(clustering1, clustering2, measure='jaccard_index',
random_model='perm'))
>>> print(sim.corrected_chance(clustering1, clustering2, measure='jaccard_index',
random_model='perm1'))
clusim.sim.sample_expected_sim(clustering1, clustering2, measure='jaccard_index', random_model='perm', n_samples=1, keep_samples=False)¶This function calculates the expected Similarity for all pair-wise comparisons between Clusterings drawn from one of six random models.
Note
Clustering 2 is considered the gold-standard clustering for one-sided expectations
| Parameters: |
|
|---|---|
| Returns: | The expected Similarity measure for all pair-wise comparisons under a random model |
>>> import clusim.clugen as clugen
>>> import clusim.sim as sim
>>> c1 = clugen.make_random_clustering(n_elements=9, n_clusters=3, random_model='all')
>>> c2 = clugen.make_random_clustering(n_elements=9, n_clusters=3, random_model='all')
>>> print(sim.sample_expected_sim(c1, c2, measure='jaccard_index', random_model='all', n_samples=50))
>>> print(sim.sample_expected_sim(c1, c2, measure='jaccard_index', random_model='all1', n_samples=50))
>>> print(sim.sample_expected_sim(c1, c2, measure='jaccard_index', random_model='num', n_samples=50))
>>> print(sim.sample_expected_sim(c1, c2, measure='jaccard_index', random_model='num1', n_samples=50))
>>> print(sim.sample_expected_sim(c1, c2, measure='jaccard_index', random_model='perm', n_samples=50))
>>> print(sim.sample_expected_sim(c1, c2, measure='jaccard_index', random_model='perm1', n_samples=50) )
clusim.sim.expected_rand_index(n_elements, random_model='num', n_clusters1=2, n_clusters2=2, clu_size_seq1=None, clu_size_seq2=None)¶This function calculates the expectation of the Rand index between all pairs of clusterings drawn from one of six random models.
See [HA85] and [GA17] for a detailed derivation and explaination of the different random models.
Note
Clustering 2 is considered the gold-standard clustering for one-sided expectations
| Parameters: |
|
|---|---|
| Returns: | The expected Rand index (between 0.0 and 1.0) |
>>> import clusim.sim as sim
>>> print(sim.expected_rand_index(n_elements=5, random_model='all'))
>>> print(sim.expected_rand_index(n_elements=5, random_model='all1',
clu_size_seq2=[1,1,3]))
>>> print(sim.expected_rand_index(n_elements=5, , random_model='num',
n_clusters1=2, n_clusters2=3))
>>> print(sim.expected_rand_index(n_elements=5, random_model='num1',
n_clusters1=2, clu_size_seq2=[1,1,3]))
>>> print(sim.expected_rand_index(n_elements=5, random_model='perm',
clu_size_seq1=[2,3],
clu_size_seq2=[1,1,3]))
>>> print(sim.expected_rand_index(n_elements=5, random_model='perm1',
clu_size_seq1=[2,3],
clu_size_seq2=[1,1,3]))
clusim.sim.adjrand_index(clustering1, clustering2, random_model='perm')¶This function calculates the adjusted Rand index for one of six random models.
See [HA85] and [GA17] for a detailed derivation and explaination of the different random models.
Note
Clustering 2 is considered the gold-standard clustering for one-sided expectations
| Parameters: |
|
|---|---|
| Returns: | The adjusted_rand Rand index |
>>> import clusim.clugen as clugen
>>> import clusim.sim as sim
>>> clustering1 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
random_model='all')
>>> clustering2 = clugen.make_random_clustering(n_elements=9, n_clusters=3,
random_model='all')
>>> print(sim.adjrand_index(clustering1, clustering2,
random_model='all'))
>>> print(sim.adjrand_index(clustering1, clustering2,
random_model='all1'))
>>> print(sim.adjrand_index(clustering1, clustering2,
random_model='num'))
>>> print(sim.adjrand_index(clustering1, clustering2,
random_model='num1'))
>>> print(sim.adjrand_index(clustering1, clustering2,
random_model='perm'))
>>> print(sim.adjrand_index(clustering1, clustering2,
random_model='perm1'))
clusim.sim.adj_mi(clustering1, clustering2, random_model='perm', norm_type='sum', logbase=2)¶This function calculates the adjusted Mutual Information for one of six random models.
See [GA17] for a detailed derivation and explaination of the different random models.
Note
Clustering 2 is considered the gold-standard clustering for one-sided expectations
| Parameters: |
|
|---|---|
| Returns: | The adjusted Mutual Information |
>>> import clusim.clugen as clugen
>>> import clusim.sim as sim
>>> clustering1 = clugen.make_random_clustering(n_elements=9, n_clusters=3,random_model='all')
>>> clustering2 = clugen.make_random_clustering(n_elements=9, n_clusters=3,random_model='all')
>>> print(sim.adj_mi(clustering1, clustering2, random_model='all'))
>>> print(sim.adj_mi(clustering1, clustering2, random_model='all1'))
>>> print(sim.adj_mi(clustering1, clustering2,random_model='num'))
>>> print(sim.adj_mi(clustering1, clustering2,random_model='num1'))
>>> print(sim.adj_mi(clustering1, clustering2,random_model='perm'))
>>> print(sim.adj_mi(clustering1, clustering2,random_model='perm1'))
clusim.sim.expected_mi(n_elements, n_clusters1=2, n_clusters2=2, clu_size_seq1=None, clu_size_seq2=None, logbase=2, random_model='num')¶This function calculates the expectation of the Mutual Informaiton between all pairs of clusterings drawn from one of six random models.
See [GA17] for a detailed derivation and explaination of the different random models.
Note
Clustering 2 is considered the gold-standard clustering for one-sided expectations
| Parameters: |
|
|---|---|
| Returns: | The expected MI (between 0.0 and inf) |
>>> import clusim.sim as sim
>>> print(sim.expected_mi(n_elements=5, random_model='all'))
>>> print(sim.expected_mi(n_elements=5, random_model='all1',
clu_size_seq2=[1,1,3]))
>>> print(sim.expected_mi(n_elements=5, , random_model='num',
n_clusters1=2, n_clusters2=3))
>>> print(sim.expected_mi(n_elements=5, random_model='num1',
n_clusters1=2, clu_size_seq2=[1,1,3]))
>>> print(sim.expected_mi(n_elements=5, random_model='perm',
clu_size_seq1=[2,3],
clu_size_seq2=[1,1,3]))
>>> print(sim.expected_mi(n_elements=5, random_model='perm1',
clu_size_seq1=[2,3],
clu_size_seq2=[1,1,3]))
clusim.sim.onmi(clustering1, clustering2)¶This function calculates the overlaping normalized mututal information.
See [LFK09] for a detailed derivation and explaination of the measure.
| Parameters: |
|
|---|---|
| Returns: | the overlaping normalized mutual information |
clusim.sim.omega_index(clustering1, clustering2)¶This function calculates the omega index between two clusterings.
See [CD88] for a detailed derivation and explaination of the measure.
| Parameters: |
|
|---|---|
| Returns: | the omega index |
clusim.sim.geometric_accuracy(clustering1, clustering2)¶This function calculates the geometric accuracy between two (overlapping) clusterings.
See [NYP12] for a detailed derivation and explaination of the measure.
| Parameters: |
|
|---|---|
| Returns: | the geometric accuracy |
clusim.sim.overlap_quality(clustering1, clustering2)¶This function calculates the overlap quality between two (overlapping) clusterings.
See [ABL10] for a detailed derivation and explaination of the measure.
| Parameters: |
|
|---|---|
| Returns: | the overlap quality |
clusim.clusimelement.element_sim(clustering1, clustering2, alpha=0.9, r=1.0, r2=None, rescale_path_type='max')¶The element-centric clustering similarity.
See [GWHA18] for a detailed explaination of the measure.
| Parameters: |
|
|---|---|
| Returns: | The element-wise similarity between the two clusterings |
>>> import clusim.sim as sim
>>> from clusim.clustering import Clustering
>>> clustering1 = Clustering(elm2clu_dict={0:[0], 1:[0], 2:[0,1],
3:[1], 4:[2], 5:[2]})
>>> clustering2 = Clustering(elm2clu_dict={0:[0,2], 1:[0], 2:[0,1],
3:[1], 4:[2], 5:[1,2]})
>>> print(sim.element_sim(clustering1, clustering2, alpha=0.9))
clusim.clusimelement.element_sim_elscore(clustering1, clustering2, alpha=0.9, r=1.0, r2=None, rescale_path_type='max', relabeled_elements=None)¶The element-centric clustering similarity for each element.
See [GWHA18] for a detailed explaination of the measure.
| Parameters: |
|
|---|---|
| Returns: | The element-centric similarity between the two clusterings for each element as a 1d numpy array |
| Returns: | a dict mapping each element to its index of the elementScores array. |
>>> import clusim.sim as sim
>>> from clusim.clustering import Clustering
>>> clustering1 = Clustering(elm2clu_dict={0:[0], 1:[0], 2:[0,1],
3:[1], 4:[2], 5:[2]})
>>> clustering2 = Clustering(elm2clu_dict={0:[0,2], 1:[0], 2:[0,1],
3:[1], 4:[2], 5:[1,2]})
>>> elementScores, relabeled_elements = sim.element_sim_elseq(clustering1,
clustering2,
alpha = 0.9)
>>> print(elementScores)
clusim.clusimelement.cL1(x, y, alpha)¶The normalized similarity value based on the L1 probabilty metric corrected for the guaranteed overlap in probability between the two vectors, alpha.
See [GWHA18] for a detailed explaination of the need to correct the L1 metric.
| Parameters: |
|
|---|---|
| Returns: | The 1d numpy array of L1 similarities between the affinity matrices x and y |
clusim.clusimelement.make_affinity_matrix(clustering, alpha=0.9, r=1.0, rescale_path_type='max', relabeled_elements=None)¶The element-centric clustering similarity affinity matrix for a clustering. This function automatically determines the most efficient method to calculate the affinity matrix.
See [GWHA18] for a detailed explaination of the affinity matrix.
| Parameters: |
|
|---|---|
| Returns: | The element-centric affinity representation of the clustering as a 2d numpy array |
>>> import clusim.sim as sim
>>> from clusim.clustering import Clustering
>>> clustering1 = Clustering(elm2clu_dict={0:[0], 1:[0], 2:[1], 3:[1],
4:[2], 5:[2]})
>>> pprmatrix = sim.make_affinity_matrix(clustering1, alpha=0.9)
>>> print(pprmatrix)
>>> clustering2 = Clustering(elm2clu_dict={0:[0], 1:[0], 2:[0,1], 3:[1],
4:[2], 5:[2]})
>>> pprmatrix2 = sim.make_affinity_matrix(clustering2, alpha=0.9)
>>> print(pprmatrix2)
clusim.clusimelement.ppr_partition(clustering, alpha=0.9, relabeled_elements=None)¶The element-centric clustering similarity affinity matrix for a partition found analytically.
| Parameters: |
|
|---|---|
| Returns: | 2d numpy array The element-centric affinity representation of the clustering |
>>> import clusim.sim as sim
>>> from clusim.clustering import Clustering
>>> elm2clu_dict = {0:[0], 1:[0], 2:[1], 3:[1], 4:[2], 5:[2]}
>>> clustering1 = Clustering(elm2clu_dict=elm2clu_dict)
>>> pprmatrix = sim.ppr_partition(clustering1, alpha=0.9)
>>> print(pprmatrix)
clusim.clusimelement.make_cielg(clustering, r=1.0, rescale_path_type='max', relabeled_elements=None)¶Create the cluster-induced element graph for a Clustering.
| Parameters: |
|
|---|---|
| Returns: | The cluster-induced element graph for a Clustering as an igraph.WeightedGraph |
>>> import clusim.sim as sim
>>> from clusim.clustering import Clustering
>>> elm2clu_dict = {0:[0], 1:[0], 2:[1], 3:[1], 4:[2], 5:[2]}
>>> clustering1 = Clustering(elm2clu_dict=elm2clu_dict)
>>> pprmatrix = sim.make_cielg(clustering1, r = 1.0)
>>> print(pprmatrix)
clusim.clusimelement.find_groups_in_cluster(clustervs, elementgroupList)¶A utility function to find vertices with the same cluster memberships.
| Parameters: |
|
|---|---|
| Returns: | a list-of-lists containing the groupings of the vertices |
clusim.clusimelement.numerical_ppr_scores(cielg, clustering, alpha=0.9, relabeled_elements=None)¶The element-centric clustering similarity affinity matrix for a partition.
| Parameters: |
|
|---|---|
| Returns: | 2d numpy array The element-centric affinity representation of the clustering |
clusim.dag.DAG(incoming_graph_data=None, **attr)¶clusim.dag.Dendrogram(incoming_graph_data=None, **attr)¶| [ABL10] | Yong-Yeol Ahn, James P Bagrow, and Sune Lehmann. Link communities reveal multiscale complexity in networks. Nature, 466(7307):761–764, June 2010. |
| [CD88] | Linda M. Collins and Clyde W. Dent. Omega: a general formulation of the rand index of cluster recovery suitable for non-disjoint solutions. Multivariate Behavioral Research, 23(2):231–242, 1988. |
| [DDiazGDA05] | Leon Danon, Albert D ıaz-Guilera, Jordi Duch, and Alex Arenas. Comparing community structure identification. Journal of Statistical Mechanics: Theory and Experiment, 2005(09):P09008–P09008, September 2005. |
| [FM83] | Edward B. Fowlkes and Colin L. Mallows. A method for comparing two hierarchical clusterings. Journal of the American Statistical Association, 78(383):553–569, 1983. |
| [GA17] | (1, 2, 3, 4, 5) Alexander J. Gates and Yong-Yeol Ahn. The impact of random models on clustering similarity. Journal of Machine Learning Research, 18(87):1–28, 2017. |
| [GWHA18] | (1, 2, 3, 4) Alexander J. Gates, Ian B. Wood, William P. Hetrick, and Yong-Yeol Ahn. On comparing clusterings: an element-centric framework unifies overlaps and hierarchy. arxiv, pages 1706.06136, 2018. |
| [HA85] | (1, 2) Lawrence Hubert and Phipps Arabie. Comparing partitions. Journal of Classification, 2(1):193–218, December 1985. |
| [Jac12] | Paul Jaccard. The distribution of the flora in the alpine zone. New Phytologist, 11(2):37–50, 1912. |
| [LFK09] | Andrea Lancichinetti, Santo Fortunato, and János Kertész. Detecting the overlapping and hierarchical community structure in complex networks. New Journal of Physics, 11(3):033015, 3 2009. |
| [Mei03] | Marina Meilă. Comparing clusterings by the variation of information. In Learning Theory and Kernel Machines, pages 173–187. Springer, 2003. |
| [NYP12] | Tamás Nepusz, Haiyuan Yu, and Alberto Paccanaro. Detecting overlapping protein complexes in protein-protein interaction networks. Nature Methods, 9(5):471–472, 2012. |
| [Ran71] | William M Rand. Objective Criteria for the Evaluation of Clustering Methods. Journal of the American Statistical Association, 66(336):846, 1971. |