DeepStream: autoencoder-based stream temporal clustering

This paper presents DeepStream, a novel data stream temporal clustering algorithm that dynamically detects sequential and overlapping clusters. DeepStream is tuned to classify contextual information in real time and is capable of coping with a high-dimensional feature space. DeepStream utilizes stacked autoencoders to reduce the dimensionality of unbounded data streams and for cluster representation. This method detects contextual behavior and captures nonlinear relations of the input data, giving it an advantage over existing methods that rely on PCA. We evaluated DeepStream empirically using four sensor and IoT datasets and compared it to five state-of-the-art stream clustering algorithms. Our evaluation shows that DeepStream outperforms all of these algorithms.


INTRODUCTION
Data stream clustering is a special type of clustering that has recently received increased attention from researchers motivated by the challenges posed by applications involving large datasets and streams of data, and the need to learn tasks in real time [1].
Several stream clustering algorithms have been proposed, most of which have two phases: an online phase, which summarizes the data into many micro-clusters, and an offline phase, in which these micro-clusters are re-clustered into a smaller number of final clusters [3]. However, none of these algorithms is suitable for clustering contextual sensor streams since these algorithms are designed to produce clusters that are agnostic to the order of arriving records. In contrast, the pcStream algorithm [19] is capable of capturing the temporal importance of the incoming data. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). SAC ' We propose a stream clustering method called DeepStream, which is designed to address the above mentioned limitations of existing stream clustering algorithms. DeepStream leverages deep learning techniques to capture the nonlinear relations of input data by using the embedded representation output of a pretrained stacked autoencoder (SAE). By integrating an SAE into the temporal stream clustering algorithm, our method aims to improve the ability to cluster such data streams, thereby improving the quality of the clusters and is able to deal better with high-dimensional data streams. We evaluated our proposed method using four different datasets and compared its performance with that of five existing stream clustering algorithms. The results show a significant improvement in clustering capabilities.
Summary of contributions. a) We propose a stream clustering algorithm which is capable of capturing temporal contexts. As a result of integrating SAEs, our method is able to (1) capture nonlinear relations between features, and (2) handle high-dimensional input vectors by using a smaller amount of input features. b) We empirically compare our algorithm with state-of-the-art algorithms. Our evaluation shows that our algorithm outperforms the other algorithms in clustering temporal context streams which are present in four datasets.

BACKGROUND AND RELATED WORK 2.1 Stream Clustering
Data stream clustering aims to find and maintain a set of valid clusters within a continuous and possibly unbounded stream of records [13]. Handling a huge amount of unbounded data poses a great challenge due to memory and computational constraints [2,13]. Stream clustering algorithms differ from ordinary clustering algorithms in several aspects: First, since stream clustering algorithms are exposed to the data incrementally, they cannot iterate over all of the data or access different data locations. In addition, most of the data needs to be discarded over time because of memory limitations. Second, the data distribution may change over time and have concept drifts. For this reason, stream clustering models should adapt themselves to new data and have the ability to disregard older, no longer relevant data. Third, incoming data streams accumulate over time, and stream clustering models do not have the time or storage space to handle large amounts of data. Therefore, stream clustering algorithms should only save the minimal amount of data needed to retain their ability to continuously cluster new data efficiently.
In this paper, we empirically evaluate DeepStream and compare it with the following state-of-the-art stream clustering algorithms: (i) DenStream [6], which is a density-based stream clustering algorithm. It uses the clustering feature (CF) form to determine whether a group of micro-clusters is a legitimate cluster or a collection of outliers; (ii) D-Stream [8], which also performs density-based stream clustering, but across a grid; (iii) DBStream [11], which captures the shared density between two micro-clusters in order to decide whether or not micro-clusters belong to the same cluster; (iv) CluStream [3], which uses a two-phased approach (dividing the clustering process into online and offline macro-clustering) that provides flexibility to explore the nature of the evolution of the clusters over different time periods; and (v) pcStream [19], which summarizes clusters using the mean and principal components (vectors of highest variance) of the cluster's last records. Unlike the other methods evaluated, pcStream considers the temporal relation between arriving records when making clustering decisions.
Of them, our algorithm works most similarly to pcStream in the way that decisions are made regarding cluster affiliation and the formation of new clusters.

Autoencoder-Based Clustering
An autoencoder (AE) is a type of artificial neural network used to learn efficient data encodings in an unsupervised manner. Traditionally, AEs were used for dimensionality reduction or feature learning. AEs take structural input data and try to reconstruct the input after encoding it. Some variants focus on the network architecture, such as the number of layers, latent dimension size, or even the use of CAE (convolutional autoencoders) for image data. Other methods are aimed at preventing overfitting and improving robustness by adding noise to the input [22] or by using dropout [21] to generalize the network. For example, DEC [25] uses a joint deep neural network (DNN) and clustering that simultaneously learns feature representations and cluster assignments with an SAE network. Dimensionality reduction using a deep belief network with nonparametric maximum margin clustering was proposed in [7]. In contrast to some of the above mentioned methods which combine the association of clusters with the neural network architecture, our method does not assume a certain number of clusters.

PROPOSED METHOD
Our proposed method, called DeepStream, consists of an offline and online phase. In the offline phase a neural network learns the data in the training set in order to generate a model that produces a compact representation of each record. The online phase continuously generates clusters from incoming data streams using the model trained in the offline phase.

DeepStream's Offline Phase
One of the key components of DeepStream consists of a stacked autoencoder (SAE), which is a more advanced architecture than a basic autoencoder. Research has shown that SAEs consistently produce semantically meaningful and well-separated representations on real-world datasets [14,23]. Thus, the unsupervised representation learned by an SAE represents the nonlinear relationship between the input variables and naturally helps DeepStream better separate between different contexts.
An SAE network is built layer by layer, and each layer is a denoising AE which is trained separately to reconstruct the previous layer's output [23]. Each denoising AE is made up of two layers.
First, the encoder input x passes through Dropout(); dropout is a regularization technique which prevents overfitting by dropping out units during the training [21]. h is the hidden representation result of the д 1 (W 1x + b 1 ) activation function, and W and b are neural network parameters. Hidden layer h adds L2 normalization to the latent space during training; in [5], the authors showed that adding L2 normalization to the latent space improves separability of clusters in a variety of deep autoencoder models. The decoder is built the same way as the encoder but in reverse order In the training phase, the mean squared error (MSE = 1 n n i=1 (x i −x i ) 2 ) function is minimized. We use the output of h as the input for training the next autoencoder.
Each internal layer uses ReLU as an activation function, because it is simple and computationally inexpensive, and it does not suffer from the vanishing gradient problem like sigmoid does. However, ReLU suffers from a different problem known as the "dying ReLU" problem [17]. In order to lower the chances of "dying ReLU," we use SELU as the internal layer activation function in some of the datasets when evaluating DeepStream's performance. After training each layer separately, we build an AE model from all of the hidden layers (h). Next, the SAE undergoes additional training in its entirety to fine-tune the parameters simultaneously. This results in a trained deep AE with the capability of producing a reduced representation of the input; the output of the encoder (i.e., the SAE's hidden layer) is then used in the online clustering phase.

DeepStream's Online Clustering Phase
DeepStream's online clustering phase is based on its predecessor, the pcStream algorithm [19], except for the way in which the input is manipulated. While pcStream relies on the soft independent modelling by class analogy method (SIMCA) [24] for similarity score calculation, DeepStream uses an SAE to encode high-dimensional input to lower dimensions in order to obtain semantically meaningful context separation by leveraging the nonlinear relationship between the input features.
DeepStream's clustering phase (Algorithm 1) has six different hyperparameters: threshold (t) represents the level of sensitivity of the algorithm; drift buffer size (d) sets the minimum size of each context; memory size (m) limits the maximum size of each context; (encoder ) stands for a pretrained encoder that encodes each input from stream X ; and context limit (lim) limits the size of the algorithm's contexts.
In lines 1-2 of Algorithm 1, a list of contexts C is initialized with a new context that contains the first d instances from stream X . The operation is performed by calling the function updateContext(X [0..d]). It is assumed that the first d records belong to the same cluster to provide the algorithm a starting point. In lines 5-16, every record (x) in the data stream is processed. First, the record is encoded to a lower-dimensional representation by the encode(x, encoder ) function.
In line 7, the Mahalanobis distance is measured onx from each context in list C. We chose to use the Mahalanobis distance over the Euclidean distance, since we can't assume that the clusters have identical co-variances. Clusters with elliptical shape co-variances Algorithm 1: DeepStream online clustering algorithm.
input : threshold (t ), drift buffer size (d ), memory size (m), trained encoder (encoder ), data stream (X ), context limit (l im) output : Context list (C ) are modeled much better using the Mahalanobis distance measure. In rare cases when features have zero correlations, the Euclidean and Mahalanobis distance are the same, except that the Euclidean distance performs faster [9]. After measuring all of the distances betweenx and the cluster centers, the closest cluster index is chosen (line 7). In cases in which the closest cluster distance is less than threshold (t): (1) the attempt to create a new cluster breaks down, and therefore if the drift buffer (denoted as Bu f f er ) is not empty, the records are dumped into the C[bestContext] cluster in function updateContext(); and (2) the current instancex joins the C[bestContext] cluster. In order to save memory, function updateContext(C[bestContext],x, m) updates the cluster C[bestContext] with the m newest incoming records. In addition, the use of the FIFO policy allows clusters to adapt themselves to changes over time (i.e., concept drift). Then, the inverse co-variance matrix is calculated for the Mahalanobis distance measurement. In cases in which the closest context distance is greater than threshold (t) (lines 13-16),x instance is appended to the drift buffer (Bu f f er ), and if the maximum number (d) of instances in the drift buffer has been reached, a new context is created with those instances.
To prevent an explosion of contexts, parameter lim limits the number of possible contexts by merging the two closest contexts in the merдe(C) function. Merging contexts is carried out by taking the newest m instances from the two merging contexts. This requires that each instance of DeepStream is saved with its arrival timestamp, in order to enable an accurate merge operation.

EXPERIMENTS 4.1 Evaluation Setup
Dataset. In order to evaluate the performance and generality of our proposed algorithm, we tested DeepStream on four real-world contextual sensor datasets, each of which captures a different context domain and constitute extensive coverage of number of input features and different level of contexts. In Table 1, we present a summary of the datasets that we used to empirically evaluate and compare the stream clustering algorithms.
Benchmarking. In our experiments, we compared DeepStream with five state-of-the-art stream clustering algorithms: pcStream [19], DBstream [11], DenStream [6], D-Stream [8] and CluStream [3]. To enable a fair comparison of the algorithms, each of them was trained on exactly the same training set and was later applied to the same test set. For each algorithm evaluated, we performed a grid search for hyperparameter tuning in order to optimize the algorithm's performance.
In order to evaluate D-Stream, DenStream, DBStream, and CluStream we used an R package called streamMOA [12]. In streamMOA, the algorithms classify records as noise if they do not fit into any cluster. In order to perform a fair comparison with pcStream and DeepStream, we only took into account results with less than 5% of noisy records. We demonstrate the effectiveness of DeepStream mainly by comparing it with pcStream, as these two algorithms are designed to find temporal contexts in data streams but use different information extraction methods.
Parameter setting. In DeepStream, the parameter tuning process is divided into two parts: (1) the offline training phase in which the AE is trained to encode the input data, and (2) the online phase in which the DeepStream model is trained to cluster the stream data in the training set. For the offline phase, we trained SAEs in order for the encoder to use them later for dimensionality reduction of the input stream. For each of the HAR, HearO, and IoT datasets, we trained an SAE whose architecture (specifically, the size of each layer) was derived from the input size of each dataset. All of the internal layers are activated by the ReLU nonlinearity function [10] except for the IoT dataset which uses SELU (scaled exponential linear unit) [16] as the internal activation function. For the online phase, we used the encoder trained in the previous offline phase and executed DeepStream with grid search on two parameters: threshold (t) and drift buffer size (d). The memory size (m) was set to 1,000 as well as the context limit.
Evaluation metric. In many stream clustering experiments the adjusted Rand index (ARI) and sum of squares (SSQ) are the most common evaluation metrics. However, the SSQ metric assumes that clusters do not overlap, and therefore it is not suitable for our clustering problem. All of the clustering methods were evaluated using the ARI metric [15] which is widely used to compare the quality of different partitions given the ground truth. The ARI is a measure of similarity between two data clustering assignments regardless of their spatial qualities.

Results
The best results are presented in Table 2, where it is evident that DeepStream outperforms all of the other methods when applied to any dataset. Since we only report the results with less than 5% of noisy records, D-Stream lacks results on the HAR, IoT, and PAMAP2 datasets. All algorithms obtained low ARI scores on the HearO-Low-High and PAMAP2 datasets (since these two datasets consist of real-life data which makes them noisy and hard to cluster).
On the HearO dataset, the results (see Table 2) show that the higher the granularity of the labels, the more sensitive the algorithm needs to be. Accordingly, the values for the selected drift size (d)  and threshold (t) parameters were lower for the 'low-level' labels compared to the 'high-level' labels. Finally, our evaluation supports the empirical evidence reported by Mirsky et al. [19], as pcStream outperforms most of the state-ofthe-art stream clustering methods evaluated (except DeepStream) on a wide range of temporal context datasets.

CONCLUSIONS
In this paper, we presented DeepStream, a stream clustering algorithm that uses deep learning to detect temporal contexts on unbounded temporal data streams. Among its strengths, we note that DeepStream (and particularly the AEs it relies on) does not require any labeled data for training. In addition, DeepStream leverages the capability of AEs to model complex nonlinear functions, compress a high-dimensional feature space into a low-dimensional latent space, and improve stream clustering performance on highdimensional IoT datasets. Our empirical evaluation showed that DeepStream performs better than state-of-the-art stream clustering algorithms, particularly on high-dimensional sensor and IoT temporal datasets.