Hypercluster: a flexible tool for parallelized unsupervised clustering optimization


	Implementation



	Requirements

The hypercluster package uses scikit-learn [35], python-igraph [36], leidenalg [37] and louvain-igraph [38] to assign cluster labels and uses scikit-learn and custom metrics to compare clustering algorithms and hyperparameters to find optimal clusters for any given input data (Fig.1). Hypercluster requires python3, pandas [39], numpy [40], scipy [41], matplotlib [42], seaborn [43], scikit-learn [35], python-igraph [36], leidenalg [37], louvain-igraph [38] and SnakeMake [34].


	General workflow and examples

Hypercluster can be run independently of SnakeMake, as a standalone python package. Input and output structure, as well as example workflows on a breast cancer RNA-seq data set [43] and scRNA-seq [45] can be found at https://github.com/ruggleslab/hypercluster/tree/master/examples. Briefly, the workflow starts with instantiating an AutoClusterer (for a single algorithm) or MultiAutoClusterer (for multiple algorithms) object with default or user-defined hyperparameters (Fig.1a). To run through hyperparameters for a dataset, users simply provide a pandas DataFrame to the “fit'' method on either object (Fig.1b). Users evaluate the labeling results with a variety of metrics by running the “evaluate” method (Fig.1c). Clustering labels and evaluations are then aggregated into convenient tables (Fig.1d), which can be visualized with built in functions (e.g. Additional file1: Fig. S1, Additional file2: Fig. S2).


	Configuring the SnakeMake pipeline

The SnakeMake pipeline allows users to parallelize clustering calculations on multiple threads on a single computer, multiple compute nodes on a high performance cluster or in a cloud cluster [34]. The pipeline is configured through a config.yml file (Table1), which contains user-specified input and output directories and files (Table1, lines 1–3, 5–7) and the hyperparameter search space (Fig.1a, Table1, line 18). This file contains predefined defaults for the search space that allow the pipeline to be used “out of the box.” Further, users can specify whether to use exhaustive grid search or random search; if random search is selected, probability weights for each hyperparameter can be chosen (Table1, line 9). The pipeline then schedules each clustering calculation and evaluation as a separate job (Fig.1b). Users can specify which evaluation metrics to apply (Fig.1c, Table1, line 10) and add keyword arguments to tune several steps in the process (Table1, lines 4, 8–9, 11–16). Clustering and evaluation results are then aggregated into final tables (Fig.1d). Users can reference the documentation and examples for more information.
As input, users provide a data table with samples to be clustered as rows and features as columns. Users can then simply run “snakemake -s hypercluster.smk -configfile config.yml” in the command line, with any additional SnakeMake flags appropriate for their system. Applying the same configuration to new files or testing new algorithms on old data simply requires editing the inputs in the config.yml file and rerunning the SnakeMake command.


	Extending hypercluster

Currently, hypercluster can perform any clustering algorithm and calculate any evaluation available in scikit-learn [35,46], as well as non-negative matrix factorization (NMF) [47], Louvain [38] and Leiden [37] clustering. Additional clustering classes and evaluation metric functions can be added by users in the additional_clusterer.py and additional_metrics.py files, respectively, if written to accommodate the same input, outputs and methods (see additional_clusterers.py and additional_metrics.py for examples).


	Outputs

For each set of labels, hypercluster generates a file with sample labels and a file containing evaluations of those labels. It also outputs aggregated tables of all labels and evaluations. Hypercluster can also generate several helpful visualizations, including a heatmap showing the evaluation metrics for each set of hyperparameters (Fig.1c) and a table and heatmap of pairwise comparisons of labeling similarities with a user-specified metric (Additional file1: Fig. S1). This visualization is particularly useful for finding labels that are robust to differences in hyperparameters. It can also optionally output a table and heatmap showing how often each pair of samples were assigned the same cluster (Additional file2: Fig. S2). Other useful custom visualizations that are simple for users to create due to the aggregated clustering results are available in our examples ( https://github.com/ruggleslab/hypercluster/tree/dev/examples).
