Data-driven Kernel-based Probabilistic SAX for Time Series Dimensionality Reduction

The ever-increasing volume and complexity of time series data, emerging in various application domains, necessitate efficient dimensionality reduction for facilitating data mining tasks. Symbolic representations, among them symbolic aggregate approximation (SAX), have proven very effective in compacting the information content of time series while exploiting the wealth of search algorithms used in bioinformatics and text mining communities. However, typical SAX-based techniques rely on a Gaussian assumption for the underlying data statistics, which often deteriorates their performance in practical scenarios. To overcome this limitation, this work introduces a method that negates any assumption on the probability distribution of time series. Specifically, a data-driven kernel density estimator is first applied on the data, followed by Lloyd-Max quantization to determine the optimal horizontal segmentation breakpoints. Experimental evaluation on distinct datasets demonstrates the superiority of our method, in terms of reconstruction accuracy and tightness of lower bound, when compared against the conventional and a modified SAX method.


I. INTRODUCTION
Representing and interpreting complex time-varying phenomena is a challenging task in several application domains. Such issues become even more demanding in view of the large volumes of time series data, emerging thanks to the advances of computing technologies. Typical examples include healthcare data, microarray gene expression data in genetics, and large panel and e-commerce data in finance and marketing, just to name a few. Efficiently mining this data deluge necessitates the extraction of descriptive motifs in appropriate lower-dimensional spaces, which provide a meaningful, yet compact, representation of the original inherent information to be further employed for executing high-level tasks, such as event detection and classification [1]- [3], among others.
The family of symbolic aggregate approximation (SAX) methods [4] has a prominent role among the several existing motif discovery techniques. Due to its conceptual simplicity and computational tractability, SAX has been widely used in monitoring, processing and mining data of numerous sources, including physiological data [5], smart grids [6], building systems [7], and stock market [8].
Although SAX and its variants [9]- [13] can lead to highprecision results in the case of data characterized by Gaussian statistics, however, their performance may degrade dramatically in more general cases. Indeed, in practical scenarios, where the underlying probability distribution of a time series deviates significantly from a Gaussian, or when the distribution changes across time, then, the previous SAX-based techniques are not capable of adapting to the time-evolving statistics. As a result, their low-dimensional representation and motif interpretation power diminishes.
Motivated by the above limitations, this work aims at negating any assumptions regarding the probability distribution of a given time series by designing a map between the time series space and the space of symbols from a predefined alphabet, which adapts directly to the data statistics. To this end, our proposed method first applies a kernel density estimator (KDE) [14] on the time series to approximate accurately the underlying probability density function (pdf). Then, the output of the KDE is coupled with a Lloyd-Max quantizer [15] to estimate the optimal horizontal segmentation breakpoints, which are further used to define the map between the time series samples and the alphabet's symbols.
Overall, the contribution of this paper is an efficient datadriven SAX-like method, hereafter referred to as probabilistic SAX (pSAX), which: (i) achieves a more accurate lowdimensional representation of a time series, and (ii) enables the construction of a distance measure in the symbols space that is the closest to that of the Euclidean distance in the raw data space, when compared with its SAX-based alternatives.
The rest of the paper is organized as follows: Section II introduces briefly the conventional SAX and discusses the differences between our proposed methodology and prior studies. Section III describes in detail our proposed pSAX method, along with a new distance measure. Section IV evaluates the performance of our method on real datasets and compares its accuracy with the conventional SAX method and one wellestablished SAX-based counterpart. Finally, Section V summarizes the main outcomes of this work and gives directions for further extensions.

II. BACKGROUND AND RELATED WORK
This section describes briefly the conventional SAX method, and discusses how our proposed approach differs from prior SAX-based representations.

A. Symbolic approximations and lower bounding distance
In the following, T N denotes the set of time series of length N , Y M is the set of real vectors of length M , and C M A the set of all vectors of M codewords belonging to an alphabet A of size |A| = α. Let U = (u 1 , u 2 , . . . , u N ) ∈ T N be a discrete time series of N samples. Without loss of generality, U is first Z-normalized to zero mean and unit variance. Then, the core of a symbolic aggregate approximation (SAX) [4] consists of two steps, coupled with the definition of an appropriate distance measure in the lower-dimensional space, which lower bounds the Euclidean distance in the original space.
The first step of SAX implements a piecewise aggregate approximation (PAA) T N → Y M , which transforms a given time series U ∈ T N into a vector Y = (y 1 , . . . , y M ) ∈ Y M , with M < N . For this, U is divided into M equal size segments and the average value is calculated for each segment. In the second step, a discretization Y M → C M A is applied to Y , which maps the averages into a predefined set of symbols. More precisely, the Z-normalized time series is assumed to follow a standard Gaussian distribution. Under this assumption, the M averages in Y are quantized within α equiprobable intervals under the standard Gaussian pdf curve. Each quantization interval is bounded by two cutlines and is assigned a codeword from the alphabet A (ref. top plot in Fig. 1 for a visualization of the two-step process). The twostep transformation T N → Y M → C M A produces the SAX representation of length M from the alphabet A.
Given two distinct time series U , S ∈ T N , their Euclidean distance is defined by To guarantee that no false dismissals occur when performing high-level tasks, such as similarity search, it is desirable to define a distance measure in the lower-dimensional space C M A that lower bounds the Euclidean distance in the original space T N [16]. Let C, Q ∈ C M A be the symbolic representations of the time series U and S, respectively. Then, a distance measure in the quantized space of alphabet symbols, which lower bounds the Euclidean distance in the original time series space is defined as follows, where dist(c i , q i ) is the absolute difference of the two closest cutlines that respectively bound c i and q i (ref. Fig. 1 for an example). Furthermore, if Y ∈ Y M is the PAA of U and Q ∈ C M A is the SAX representation of S, a tighter lower bounding distance measure is defined by where β Li and β Ui are the lower and upper cutlines of codeword q i . By combining (1) and (3), the tightness of lower bound (TLB) measure is defined as Note that conventional SAX-based techniques compute the optimal cutlines based on a Gaussian assumption for the data distribution. As such, the larger the deviation of the true distribution from a Gaussian, the greater the information loss, which severely decreases the accuracy of the symbolic representation. To alleviate this drawback, our method adapts directly to the data using a kernel density estimator (KDE). Regarding the computation of the optimal cutlines, the Lloyd-Max quantizer is employed, which minimizes the distortion based on a mean squared error (MSE) criterion.

B. Relation to prior work
The method presented here takes advantage of the modelfree data-adaptive nature of KDEs, along with the optimality (in the MSE sense) of Lloyd-Max quantization, for the samples-to-symbols assignment. Previous works also perform a data-driven discretization of time series by employing Lloyd's algorithm [17], [18], self-organizing maps [19], or other clustering methods [20]. A modified version of SAX (a.k.a. aSAX) [9] employs the k-means algorithm for the discretization, which is the only data-adaptive extension of SAX to the authors' knowledge. While the present study is related to prior works in data-driven discretization, our methodology capitalizes on the fact that Lloyd-Max quantization partitions the data according to the estimated underlying probability distribution. In earlier studies, the k-means and the other  clustering methods partition directly the observed samples, overlooking the overall behavior of the data source. As a result, our method increases the symbolic representation's accuracy by better adapting to the probabilistic structure of time series data.

III. DATA-DRIVEN PROBABILISTIC SAX
In contrast to the conventional SAX-based approaches, our method exploits the efficiency of KDEs [14] to approximate accurately the true distribution by adapting directly to the data, without any prior probabilistic assumption. To further enhance the generalization capability of our method, the KDE-based step is coupled with a Lloyd-Max quantizer [15] to compute the optimal cutlines. Fig. 2 illustrates the overall pipeline of our method to be analyzed below, whilst the bottom plot in Fig. 1 visualizes the outcome of our proposed symbolic representation, as opposed to the conventional SAX (top plot).

A. Kernel density estimator
By definition, a kernel density estimator is the summation of a set of translated and dilated kernel functions centered at each observed sample, as follows, where x i , i = 1, . . . , N , are the observed samples, K(·) is the kernel function which controls the weight given to the neighboring samples of x ∈ X, and h is a smoothness parameter. In our implementation, we employ the Epanechnikov kernel [21]. This choice is motivated by (i) the asymptotic optimality (it minimizes the asymptotic mean integrated squared error) and (ii) the compact support of this kernel, which yields an increased computational efficiency. In advance, the smoothness parameter was chosen according to Silverman's rule of thumb [22,Sec. 3], reduced by a factor.

B. Lloyd-Max quantizer
The Lloyd-Max quantization algorithm (ref. Alg. 1) reduces iteratively the MSE between the computed codewords and the source by alternating between two necessary optimality   conditions regarding the update of the boundaries (line 4) and the centroids (line 5). A maximum number of iterations is chosen as the stopping criterion, which is set equal to 100 in our implementation. Furthermore, in contrast to the conventional case, we initialize the quantizer using the k-means++ algorithm [23], yielding an improved convergence to a local minimum, in terms of MSE and convergence speed. Notice that a training phase is required for our pSAX method. Specifically, the KDE module for pdf estimation is trained first with a sufficient number of PAA segments. Then, the Lloyd-Max quantization intervals are calculated based on the estimated distribution. Nevertheless, this is done only once during initialization. As such, the running time of the training phase does not contribute to the running time of the subsequent dimensionality reduction process. Besides, our algorithm is trained efficiently using a highly reduced set of training sequences (at the order of 10 in this study). Table I shows the training times of pSAX and aSAX. As expected, the training of pSAX takes longer than its aSAX counterpart, which is due to the KDE step and the numerical integrations in the Lloyd-Max step.

C. A novel distance measure
The Lloyd-Max quantizer provides arithmetic values for the codewords, apart from the cutlines. Formally, these values are the centroids of the bounded regions. This feature can be further exploited to define a new distance measure between two symbolic sequences Q, C ∈ C M A , as follows, Although this measure does not lower bound the Euclidean distance, however, it is the closest to the true Euclidean distance in the MSE sense, up to a distortion caused by the KDE and Lloyd-Max steps. Accordingly, a distance measure between a time series and a symbolic sequence can be derived by utilizing the computed centroids. Specifically, given U ∈ T N and C ∈ C M A , their distance is defined by In the special case when C is the symbolic representation of U , then d e is the root mean squared error (RMSE) between U and its reconstruction from C. It is worth noting that the k-means algorithm used by aSAX computes the centroids, too. However, these centroids are computed by using the observed samples, and not by exploiting the underlying probability distribution of the data. KDE is trained with the same set of training samples but, as we will show in the following section, the estimated probability distribution can better approximate the true centroids of the whole dataset.

IV. EXPERIMENTAL EVALUATION
This section evaluates the performance of pSAX and compares against aSAX and the conventional SAX, with respect to the achieved TLB (ref. (4)) and RMSE (ref. (7)) values. The methods are compared for a varying symbolic sequence length M ∈ {32, 48, 64, 80} (the lower this number, the higher the dimensionality reduction), alphabet size α ∈ {8, 16, 32, 64, 128} and time series subsequence length N ∈ {480, 1920} (short and long). Four distinct datasets are employed (Koski ECG, Muscle Activation, Rittweger EOG, and Respiration) 1 , which are characterized by both structured and complex behaviors. For each dataset, the results are averaged over 8000 Monte Carlo iterations, each one corresponding to a randomly selected segment from the associated time series. The length of the segments N and the parameters M, α are chosen in compliance with the experimental sections in [4], [9]. In order to simulate streaming scenarios, the training samples are taken from the beginning of each dataset. We note that, although our method does not require a Z-normalization of the time series, however, we also Z-normalize the given data for a fair comparison with the other methods.
As a first experiment, we investigate the effect of the symbolic sequence length M on the performance of our method. In particular, Fig. 3 shows the average TLB and RMSE values for the Respiration dataset (characterized by dense spikes). As it can be seen, pSAX achieves a tighter lower bound, yielding a more accurate lower-bounding distance (3) in the lower-dimensional space. Additionally, pSAX achieves a more accurate reconstruction (lower RMSE) against both SAX and aSAX. Furthermore, the superiority of pSAX is more prominent as M increases. Table II shows the average TLB and RMSE values for two additional datasets, namely, the Rittweger EOG (a waveform with varying frequency) and Koski ECG (characterized by a repetitive pattern). As it can be seen, similar results are obtained, demonstrating the superiority of our proposed method. 1   Next, we study the effect of the alphabet size α on the performance of pSAX. To this end, Fig. 4 shows the average TLB and RMSE values for the Muscle Activation dataset (a noisy periodic signal), as a function of α. The experiments show that pSAX achieves a tighter lower bound, along with a better reconstruction quality, when compared against SAX. The same holds against aSAX most of the times, except for a few datasets, when the alphabet size is α ≤ 8. As expected, the larger the alphabet size the more improved the performance of all the three methods (i.e., higher TLB and lower RMSE). Similar results are shown in Table III for the Respiration and Muscle Activation datasets, demonstrating the efficiency of pSAX in adapting to distinct data generating processes. Note that, for the calculation of RMSE (7), the pSAX codewords are computed using the Lloyd-Max algorithm, whilst for aSAX and SAX they are computed using the k-means and the line 5 in Algorithm 1, respectively.
Overall, we conclude that, under all the experimental parameters settings tested herein, pSAX achieves a tighter lower bound (i.e., closer to 1) and a smaller RMSE when compared with SAX. Moreover, in the vast majority of the settings, pSAX also outperforms aSAX, except for a few cases when the   alphabet size is very small. Finally, note that, when SAX-based methods are used for data indexing purposes, the achieved speedup is generally nonlinear with respect to TLB [10]. Thus, we expect that our pSAX method will require a reduced number of disk accesses, when compared against aSAX and SAX, due to its higher TLB. However, the design of a pSAXbased indexing technique is left as a future thorough study.

V. CONCLUSIONS AND FUTURE WORK
This work proposes a new symbolic representation method, which generalizes previous SAX-based techniques by adapting directly to the underlying probability distribution of a given time series, without any prior model assumption for the data generating process. To this end, our pSAX method exploits the power of KDEs for accurate pdf estimation by a restricted amount of training samples, with the efficiency of Lloyd-Max quantization for optimizing the cutlines and associated codewords. Furthermore, we introduce a novel distance measure in the lower-dimensionality space of symbolic sequences, which can be employed in data mining tasks. Most importantly, our method can be coupled with other variants of SAX in a straightforward way, to improve their performance in the case of non-Gaussian data.
Currently, time series and their statistics are considered in a static framework by our method. As a further generalization of pSAX, an online extension over sliding windows is under investigation for real-time applications. To this end, we are interested in tracking the evolution of data statistics across time, while also applying a dynamic quantization scheme, under execution time constraints. Finally, the efficacy of our proposed distance measure will also be evaluated in data indexing scenarios.