PyBDA: a command line tool for automated analysis of big biological data sets


	Overview

PyBDA provides various statistical methods and machine learning algorithms that scale to very large, high-dimensional data sets. Since most machine learning algorithms are computationally expensive and big high-dimensional data does not fit into the memory of standard desktop computers, PyBDA uses Apache Spark’s DataFrame API for computation which automatically partitions data across nodes of a computing cluster, or, if no cluster environment is available, uses the resources available.
In comparison to other data analysis libraries, for instance [8,9], where the user needs to use the provided API, PyBDA is a command line tool that does not require extensive programming knowledge. Instead the user only needs to define a config file in which they specify the algorithms to be used. PyBDA then automatically builds a workflow and executes the specified methods one after another. PyBDA uses Snakemake [10] to automatically execute these workflows of methods.
Specifically, PyBDA implements the following workflow to enable pipelining of multiple data analysis tasks (Fig.1): PyBDA builds an abstract Petri net from a config file containing a list of statistical methods or machine learning algorithms to be executed. A Petri net is a bipartite, directed graph in which one set of nodes represents conditions (in our case data sets) and the other set represents transitions (in our case operations like machine learning methods and statistical models). A transition in a Petri net model can only be enabled if a condition is met, i.e., in our case when a data set that is used as input for a method exists on the file system. Firing a transition leads to the creation of a new condition, i.e., a new data set. Every operation in the Petri net, i.e., every triple of input file, method and output file, is then executed by Snakemake. The method of every triple is a Python module with the main functionality being implemented with Spark’s DataFrame and RDD API or MLLib. By using Spark, data sets are automatically chunked into smaller pieces, and executed on a distributed high performance computing (HPC) cluster in parallel on multiple cores. Through distributed, parallel computing it is possible to fit models and apply methods even to big, high-dimensional data sets.



	Comparison to other big data tools

In the last decade several big data analysis and machine learning frameworks have been proposed, yet none of them allow for easy, automated pipelining of multiple data analysis or machine learning tasks. Here, we briefly compare the pros and cons of PyBDA with some of the most popular frameworks, including TensorFlow [11], scikit-learn [8], mlr [9], MLLib [6] and h20 [12]. Furthermore, many other machine learning tools, such as PyTorch [13], Keras [14] or Edward [15] that are comparable in functionality to the previous frameworks exist. For the sake of completeness, we also mention tools for probabilistic modelling, such as PyMC3 [16], GPFlow [17] or greta [18] which, of course, are primarily designed for statistical modelling and probabilistic programming and not for big data analysis.
We compare the different tools using the following criteria (Table1): (1) how easily can the tool be used, especially w.r.t. programming knowledge (usability), (2) how much time does it take to implement a method/model once the API has been learned (time to implement), (3) how much knowledge of machine learning (ML), optimization, modelling and statistics is needed to use the tool (ML knowledge), (4) is it possible to use big data with the tool, i.e., does it scale well to big and high-dimensional data sets (big data), (5) how many methods are supported from scratch without the need to implement them (supported methods), and (6) is the tool easily extended with new methods, e.g., using the provided API (extensibility).

In comparison to PyBDA, the other methods we considered here are either complex to learn, take some time to get used to, or are not able to cope with big data sets. For instance, TensorFlow scales well to big, high-dimensional data sets and allows for the implementation of basically any numerical method. However, while being the most advanced of the compared tools, it has a huge, complex API and needs extensive knowledge of machine learning to be usable, for instance to implement the evidence lower bound of a variational autoencoder or to choose an optimizer for minimizing a custom loss function. On the other hand, tools such as scikit-learn and mlr are easy to use and have a large range of supported methods, but do not scale well, because some of their functionality is not distributable on HPC clusters and consequently not suitable for big data. The two tools that are specifically designed for big data, namely MLLib and h20, are very similar to each other. A drawback of both is the fact that the range of models and algorithms is rather limited in comparison to tools such as scikit-learn and mlr. In comparison to h20’s H20Frame API, we think Spark not only provides a superior DataFrame/RDD API that has more capabilities and is easier for extending a code base with new methods, but also has better integration for linear algebra. For instance, computation of basic descriptive statistics using map-reduce or matrix multiplication are easier implemented using Spark.
PyBDA is the only specifically built to not require much knowledge of programming or machine learning. It can be used right away without much time to get used to an API. Furthermore, due to using Spark it scales well and can be extended easily.


	Implementation



	Supported algorithms

PyBDA comes with a variety of algorithms for analysing big data from which the user can choose (Table2). Unless already provided by MLLib, we implemented the algorithms against Spark’s DataFrame API. Especially efficient implementations of common scalable dimension reduction methods included in PyBDA, such as kernel principal component analysis (kPCA), independent component analysis (ICA), linear discriminant analysis (LDA) and factor analysis (FA), have been missing in current open source software entirely. PyBDA primarily supports simple models that do not trade biological interpretability for mathematical complexity and performance.



	Running pyBDA

In order to run PyBDA on a Spark cluster, the user needs to provide an IP address to which Spark sends its jobs. Consequently, users need to either setup a cluster (standalone, Kubernetes, etc.) or submit jobs to the local host, where the strength of PyBDA is computation on a distributed HPC environment. Given the IP of the Spark cluster, the user needs to provide a config file with methods, data files, and parameterization. For instance, the config file provided in Fig.2a will first trigger dimension reductions using principal component analysis (PCA) and ICA to 5 dimensions on a data set called single_cell_samples.tsv and feature names provided in feature_columns.tsv. PyBDA then uses the outputs of both methods and fits Gaussian mixture models (GMM) and runs k-means to each output with 50, or 100, cluster centers, respectively (resulting in four different results). In addition, a generalized linear model (GLM) and a random forest (RF) with binomial response variable (named is_infected) will be fitted on the same features. Thus, PyBDA automatically parses all combinations of methods and automatically executes each combination (Fig.2b shows the corresponding Petri net of files and operations). The results of all methods are written to a folder called results. For each job, PyBDA allows Spark to use 15Gb of driver memory (for the master) and 50Gb memory for each executor (the main process run by a worker node).

