Nonstationary Discrete Convolution Kernel for Multimodal Process Monitoring

Data-driven process monitoring has benefited from the development and application of kernel transformations, especially when various types of nonlinearity exist in the data. However, when dealing with the multimodality behavior that is frequently observed in the process operations, the most widely used radial basis function (RBF) kernel has limitations in describing process data collected from multiple normal operating modes. In this article, we highlight this limitation via a synthesized example. In order to account for the multimodality behavior and improve the fault detection performance accordingly, we propose a novel nonstationary discrete convolution kernel, which derives from the convolution kernel structure, as an alternative to the RBF kernel. By assuming the training samples to be the support of the discrete convolution, this new kernel can properly address these training samples from different operating modes with diverse properties and, therefore, can improve the data description and fault detection performance. Its performance is compared with RBF kernels under a standard kernel principal component analysis framework and with other methods proposed for multimode process monitoring via numerical examples. Moreover, a benchmark data set collected from a pilot-scale multiphase flow facility is used to demonstrate the advantages of the new kernel when applied to an experimental data set.


I. INTRODUCTION
K ERNEL transformation in multivariate statistical process monitoring (MSPM) has been popular due to its ability to handle nonlinearities existing in the process data and its compatibility with various dimension reduction algorithms, such as principal component analysis (PCA) [1], partial least squares (PLS) [2], and independent component analysis (ICA) [3]. The ability of MSPM in identifying new operating modes, some of which may reflect faults in the process, has been improved substantially by adopting the aforementioned kernel-based approaches. However, nonlinearity in the process data may be caused by a different process behavior and can take various forms. For example, varying loading conditions or demands of production can mean that a process may run in multiple different modes even during the course of a typical, healthy operation. Data recorded from such a process will itself be multimodal in nature. It is important to be able to account for this multimodality so that anomalous behavior may be distinguished from normal operations accurately and robustly. Though a variety of kernel structures have been proposed and reviewed in kernel-based learning in general, the radial basis function (RBF) kernel has been the most widely used. While its advantages in one-class classification problems have been previously discussed [4], Li and Yang [5] also highlighted that a single selected RBF kernel is usually not the most effective for detecting various faults. Moreover, the separability of data could be even worse after kernel projection when an inappropriate kernel is used [6]. Recently, several efforts have been made to apply kernel-based approaches to multimodal process monitoring [7]- [12]. Though still using RBF kernels, these approaches adopt localization factors, such as just-in-time and nearest neighbors, to calculate a revised kernel matrix, indicating that on their own, RBF kernels may not be able to fully capture the covariance existing between samples due to their inadequacy in considering varying data behavior caused by different operating modes. It will be demonstrated later in this article that a single RBF kernel is not sufficient to describe the normal data set if the data are collected from multiple normal operating modes. Similar issues that exist in data explanation and modeling based on RBF kernels, such as spatially varying length scale and inhomogeneity of covariances, have been reported in the areas of geostatistics [13], terrain surface estimation [14], and natural language modeling [15].
The reason that stationary kernels, such as the RBF kernel, are insufficient is that the multimodality issue may lead to the covariance structure of process variables varying between operating modes. The nonstationary kernel is proposed to cope with this issue. Though a generic formulation of such nonstationary kernels based on convolution is available in [16], in practice, a parameter tuning step is necessary [14], [15]. Amari and Wu [17] proposed a revised kernel structure that included a data-dependent weighting function to the original RBF kernel. This structure has been applied to MSPM [18], [19]. However, it is also necessary to determine the data-dependent weighting function via a separate optimization step [6], [20]. Other kernel formulations have been explored for various aspects of the kernel-based approach. For example, spectral mixture kernels may improve extrapolation in the Gaussian process modeling [21]. Multiple kernel learning approaches that combine existing kernels are reported in [22].
This article proposes a novel formulation of nonstationary and data-dependent kernel functions that can account for the varying covariance structure caused by multimodality without extra parameterization. Since the convolution formulation can generate both the stationary RBF kernel and nonstationary kernels, we define the nonstationary discrete convolution (NSDC) kernel as the covariance function of the outputs of the nonlinear regression with a finite number of basis functions, yielding the convolution on discrete, finite support. By using the samples from normal operation as the centers of the basis functions, the NSDC kernel can improve the accuracy of kernel-based models in kernel PCA. In contrast to other nonstationary formulations, this new kernel does not introduce additional parameters other than the kernel width used in RBF kernels. Therefore, the NSDC kernel can improve fault detection performance without causing overfitting issues.
The rest of this article proceeds as follows. First, we briefly review the algorithm structure of kernel PCA and demonstrate the limitation of the RBF kernel in accounting for multimodal data in MSPM using kernel PCA via an illustrative example. To retain the advantages of the RBF kernel while also making better use of the multimodal training set, Section III proposes the NSDC kernel and discusses the selection of the associated kernel widths. Numerical simulations are used to compare the anomaly detection performance of the NSDC-kernel PCA with the RBF-kernel PCA and other approaches proposed for monitoring multimodal processes. Section IV also compares the performance of the NSDC kernel and the RBF kernel in the process monitoring under the kernel PCA framework using the PRONTO benchmark data set. This article ends with a discussion of the qualitative comparison with other methods and the implementation considerations of the NSDC kernel and the conclusions.

A. Mathematical Preliminary
The general structure of an MSPM algorithm based on kernel PCA is shown in Fig. 1. For simplicity, the measurement x is taken as a scalar input to the algorithm.
Kernel PCA first projects the measurements x to a higher dimensional nonlinear variable space . 1 Instead of assuming functional structures with respect to x for these nonlinear variables, kernel functions are defined for x in order to obtain K , the covariance matrix of unknown nonlinear variables, directly. For example, K is defined by the RBF kernel function for i th and j th samples of x in the following equation: where l 2 is the kernel width parameter. In practice, it is common to assume l 2 = δσ 2 x , where σ 2 x is the sample variance of x and δ is a scaling factor.
It has been proven that PCA can be applied to K and feature extraction may be realized in the nonlinear variable space by solving the following eigenvalue problem [24]. Assume that n samples of x are available for training where α = {α 1 , . . . , α n } and k i = {K 1,i , K 2,i , . . . , K n,i }. Therefore, representative features z are extracted from the nonlinear variable space using K in kernel PCA. Similar to PCA-based MSPM, these features are further divided into principal components (PCs) for monitoring systematic errors and residuals for model-data mismatch according to their eigenvalues. The PCs and residuals are used for calculating monitoring statistics T 2 and T 2 E , respectively [25]. When implementing this MSPM algorithm, the monitoring model is trained offline using kernel PCA and the control limits of the monitoring statistics are set. In online monitoring, monitoring statistics of a new sample are calculated using the same kernel PCA model and are compared with their control limits for fault detection. A detailed description of the MSPM algorithm based on kernel PCA can be found in [26].

B. RBF Kernel Performance in Multimodal Data
The following bivariate model of x = [x 1 , x 2 ] with four operating modes is considered, and 100 samples are drawn randomly from each to formulate the training set, as shown in Fig. 2.
Mode 1: where e 11 ∼ N(0, 1) and e 12 ∼ N(0, 9). Mode 2: x 1 = e 21 + 8 where e 21 ∼ N(0, 2.25) and e 22 ∼ N(0, 0.25).  x 1 = e 41 + 9 where e 41 ∼ N(0, 1) and e 42 ∼ N(0, 0.25). It may be observed from the trend plot in Fig. 2 and from the scatter plot in Fig. 3 that the variance of the data set from Mode 1 is larger than the variances of the other three sets. In practice, such differences in variance may exist due to the nonlinearity in the process variables or measurement instruments. For example, a flow measurement might have higher measurement variability if the air is entrained in the process fluid.
The limitation of RBF kernels will be demonstrated via kernel PCA models trained from this data set. The T 2 statistic defined after kernel PCA projection represents the Mahanalobis distance defined in the feature space [1]. After obtaining representative features z, the first r features in z that hold more than 99% of the overall variability in the eigenvalues are selected as PCs and are used for T 2 calculation. In order to show the boundaries for anomaly detection, 95% control limits of T 2 obtained after kernel PCA with different scaling factors δ are visualized alongside the original samples. When inspecting Fig. 4, sample A may be considered as belonging to Mode 1 and sample B may be considered to be an anomaly because it cannot be clearly associated with any existing mode. While sample B can only be detected when δ = 0.1, sample A will be identified as anomalous due to the overfitting issue of the same model. Therefore, monitoring models obtained by kernel PCA with RBF kernels have limitations when applied to this data set.
It can be observed that large δ values will result in overly relaxed detection boundaries that cannot detect the transitions and/or deviations from normal operating modes (δ = 1, 5); on the other hand, the number of PCs retained after kernel PCA model increases rapidly and will soon result in overfit models as δ reduces (δ = 0.5, 0.1). In summary, the model accuracy for the multimodal data set may be limited if trained using a single RBF kernel.

III. NSDC KERNEL
The major notations used in this section are as follows: . . , x m }: a vector with m measured process variables; 2) C ∈ R m×n : the data set from normal operations with n samples of x used for training the model; 3) y ∈ R: a variable defined in the nonlinear space (x); 4) φ(x) : a function of x that is used as a basis function to reconstruct y; 5) l 2 : kernel widths for RBF and NSDC kernels.

A. NSDC Kernel as a Covariance Function
Equation (7) gives a regression model using p basis func- where x ∈ R m is the model input and y ∈ R is the model output. w i ∼ N(0, σ 2 w ) are the regression coefficients with the i.i.d. Gaussian distribution corresponding to basis functions The covariance of two new outputs, y and y * , can be calculated as a function of input samples, namely, x and x * . This is known as the kernel function where When the RBF is adopted in (7) with c (i) as its center, the covariance of y and y * is where c (i) is the center of the i th RBF.
When an infinite number of basis functions are considered, where σ 2 w is selected as (σ 2 0 / p) to avoid the covariance value approaching to infinity (see the Appendix). Equation (11) is the convolution of two Gaussian functions. This convolution formulation leads to a scaled and multivariate formulation of the RBF kernel function presented in (1). The covariance matrix K in kernel-based methods can be calculated accordingly.
Like the RBF kernel [see (11)], the NSDC kernel also derives from (10). Instead of having an infinite number of centers allocated from −∞ to ∞, the data dependence in the kernel function can be reflected by selecting only training samples as the centers of the basis functions. Assuming that P clusters of normal training samples obtained from P operating modes exist in C ∈ R m×n , i.e., C = C 1 · · · C P , the kernel function can be defined with C as its support where c (i) ∈ C. It is important to note that the number of basis functions is also equal to n, the number of samples from normal operating modes. The univariate and multivariate solutions to the discrete convolution structure in (12) will yield the new NSDC kernel function. For conciseness, we denote k NSDC as k from now on.

B. Univariate Formulation
For simplicity, we first assume x to be univariate. By using RBFs in (12), the NSDC kernel can be derived as where d = x − x * is the distance between x and x * . This new kernel has a similar formulation to the RBF kernel. However, given c i ∈ C, the weighting coefficient 2 l 2 is proportional to the conditional likelihood of (x + x * )/2 given the training set C, P KDE ((x + x * )/2|C), using kernel density estimation. Therefore, the extra weighting coefficient makes this new kernel dependent on the training set C. In addition, this kernel is nonstationary as it is dependent not only on the distance d between two input samples but also on the locations of these samples.
Moreover, when considering the autocovariance of a single sample x * , d = 0 and its mean is The autocovariance of x * is, therefore, proportional to the conditional likelihood of x * , given the training set C.

C. Multivariate Extension
For the multidimensional case, x is taken to be an mdimensional vector. The NSDC kernel is, thus, given as Similar to the univariate case, this revised kernel function is the product of the RBF kernel with respect to the distance d = x − x * and the likelihood of the mean (x + x * )/2, given C using kernel density estimation.

D. Specification of Kernel Width
Due to the multimodal nature of the training set, the ideal kernel width, which represents the rate of covariance of two projected samples decreasing with respect to the distance between the original samples, may vary because of: 1) different underlying mechanisms of each operating mode and 2) different variances of individual variables in the same operating mode. Therefore, it is necessary to specify the kernel widths properly.
1) Kernel Widths for Individual Variables: RBF kernels based on the Mahanalobis distance have been investigated as an approach for estimating the kernel width of variables and avoiding the optimization of kernel width parameters [27], [28]. Similarly, in the NSDC kernel, the kernel width can be estimated by the covariance matrix of process variables and a global scaling factor. By introducing the m × m covariance matrix = cov(x), the basis function φ c (i) (x) will be revised The NSDC kernel is derived accordingly The covariance matrix can be estimated using the sample covariance of the training set. The scaling factor δ is the only parameter to be specified.
2) Kernel Widths for Operating Modes: Kernels based on the Mahanalobis distance have been widely studied and applied to different areas in the literature. However, multimodal behavior in the process data is less studied because the training clusters are not represented explicitly in the kernel function. When a priori information about data clusters with respect to operating modes and transition periods is available, it is possible to assign an individual covariance matrix p for the pth cluster C p with n p samples in order to represent different operating modes. The basis function for the pth cluster is where c (i) p ∈ C p . The kernel function can be constructed accordingly where p = cov(x) such that x ∈ C p can be estimated by the sample covariance of the pth data cluster and the scaling factor δ. In particular, when the data clustering information is not available, the NSDC kernel can still be implemented by assuming P = 1, yielding (17).
To summarize, the new NSDC kernel adopts the sample covariance matrices of each data cluster in its formulation. In this formulation, the scaling factor δ regulates the overall behavior of the NSDC kernel, while the nonstationary covariance structure is captured by the varying sample covariances. Consequently, compared to the RBF kernel, the NSDC kernel can handle the nonstationary behavior caused by multiple operating modes without introducing additional parameters.

E. Monitoring Statistics
Under the kernel PCA framework presented in Section II-A, we apply PCA to the kernel matrix K obtained by the NSDC kernel for feature extraction. Assume that the training data set is [x (1) , x (2) , . . . , x (n) ] ∈ R m×n ; q features, namely, z (1) , z (1) , . . . , z (q) , are obtained by applying PCA to its kernel matrix K such that K i j = k(x (i) , x ( j ) ). Following the algorithm structure in Fig. 1, the sum of squares of Mahanalobis distances in the PC space and the sum of squares of residuals are used for quantifying the systematic error and model-data mismatch. Equations (20) and (21) define these two statistics with respect to the features where z R = {z (1) , z (2) , . . . , z (r) } are the first r PCs that explain the majority of variability in the feature space and z E = {z (r+1) , z (d+2) , . . . , z (q) } are considered as the residual vectors with minimal variability; D R is an r × r diagonal matrix with the first r eigenvalues λ 1 , λ 2 , . . . , λ r corresponding to z R in the descending order. By introducing T 2 and T 2 E , both systematic and model-data mismatch faults can be detected. In order to ensure potential model-data mismatch behaviors being captured by T 2 E , q should be sufficiently large. For fault detection, the lower control limit of T 2 and the upper control limit of T 2 E with a certain confidence level are defined by applying kernel density estimation to T 2 and T 2 E values on the training set. The reason for using lower control limits of T 2 is that due to the multimodality in the data, it is not appropriate to center the data using their mean. Therefore, if the data set and the kernel matrix are not centered, the zeromean assumption of the PCs is no longer valid. In summary, a sample is detected as faulty if the following holds for its monitoring statistics:

F. Parameter Tuning
It has been shown in Section II-B that selections of the scaling factor δ will result in MSPM models varying significantly. Recent literature on kernel-based MSPM still adopts various empirical values of kernel width [29], [30] or empirical equations [31]. Reference [1] pointed out that the parameter tuning of KPCA may lead to improved model robustness, whereas model sensitivity decreases. In this work, we will determine the scaling factor for RBF and NSDC kernels by cross validation using the data set collected from normal operation; however, instead of minimizing the alarm rate on the cross-validation set, this strategy aims at a balance between the sensitivity and the robustness of monitoring models.
We propose to select the parameter that maintains a reasonable alarm level on the cross-validation set. For example, 95% control limits of T 2 statistics, which assumes 5% of the data in the training set should indicate an anomaly, might be selected as the control limit. Given that the cross-validation set is comprised of random samples from the same data as the training set, ideally, 5% of the cross-validation data should also be indicated as anomalous. If more than 5% of data from the cross-validation set are indicated to be anomalous, this would indicate that the parameters of the model are inappropriate such that, in practice, the model would be prone to false alarms, e.g., due to overfitting of the model. On the other hand, if less than 5% of data from the cross-validation set are indicated as anomalies, this would indicate that the parameters of the model are such that the model would be prone to missed alarms, e.g., due to underfitting. To further increase the confidence in the obtained anomaly detection rates, results may be averaged over multiple Monte Carlo simulations. By repeating this analysis for multiple selections of the scaling factor δ, it is possible to identify the δ that optimally describes the training data.

G. NSDC-Kernel PCA in Process Monitoring
The flowchart in Fig. 5 summarizes the procedure of using the NSDC-kernel PCA for process monitoring. In comparison with Fig. 1, the NSDC-kernel PCA model is trained using the clustered multimode training data, and the kernel width δ is tuned using the strategy introduced previously. When deployed for online monitoring, the NSDC-kernel PCA model can be used without additional clustering or tuning.

A. Numerical Simulation
The data are both generated and used in batch for model training in this case study. The same data set generated in Section II-B is used for performance comparison of the NSDC kernel and the RBF kernel. The first r PCs with 99% accumulated variability are chosen as z R , i.e., r is selected such that where λ i is the i th element of q eigenvalues corresponding to z in the descending order. The 95% control limits of NSDC-kernel PCA are visualized in Fig. 6. These control limits identify sample A as a normal sample and detect sample B as an anomaly. By comparing Figs. 4 and 6, we can conclude that NSDC-kernel PCA gives better descriptions of the multimodal training data set and will significantly improve the anomaly detection performance. Fig. 7 shows the monitoring contours generated by the NSDC kernel for two further examples. In these samples, the data are not clustered in advance. By setting the cluster number P = 1, one can implement the NSDC kernel defined by (17) and obtain proper monitoring contours. The contours also demonstrate that the NSDC kernel can cope with other types of nonlinearity without considering the varying covariance structures of each data cluster.
These results indicate that the NSDC kernel will yield a kernel PCA model that generates a better control limit than the RBF kernel for anomaly detection of multimodal data. The NSDC kernel also suffers less from the issues associated with overfitting or underfitting. Even when there is no data clustering information available, the performance of the NSDC kernel will not be significantly compromised. It also indicates that the NSDC kernel can handle other types of data nonlinearity in addition to the multiple operating modes.

1) Process Description:
We compare the fault detection performance of the RBF-kernel PCA and the NSDC-kernel PCA using the PRONTO benchmark data set that is established for algorithm design and validation for process monitoring [32]. This data set was collected sequentially during an experiment on the multiphase flow facility located at Cranfield University. Being a fully automated industrial-scale pilot plant, this facility implements mixing, transportation, and separation of multiphase flows, such as oil, water, and air. Fig. 8 presents the layout of this facility. One may refer to [32] for further details about the facility and the benchmark case study. Table I summarizes the process variables used in this test. The multimodal behavior is realized by specifying inlet water and airflow rates according to Table II. A high-density plot [33] shown in Fig. 9 visualizes the normalized time trends of process variables measured in normal operating modes.
As shown in Table III, three faults have been seeded individually by manually opening (for air leakage and diverted flow) or closing (for air blockage) corresponding valves in both   TABLE I  PROCESS VARIABLES   TABLE II  NORMAL OPERATING CONDITIONS operating Modes A and B. Starting from the normal operating condition, the valve opening is changed gradually in order to simulate the development of incipient faults in real-life process operations. As an example, Fig. 10 presents the high-density plot of process variables when Fault 2 was seeded in operating mode B.   2) Results: The case study in this work uses both normal and faulty data from the experiment. Normal data from Modes A and B are randomly partitioned into training, cross validation, and test sets with an equivalent amount of samples. The training and cross-validation sets are used to train the monitoring model. The monitoring model is then applied to the test data set and the faulty data for fault detection. The data are used in batch to train the monitoring model and detect the faults. Since the temporal correlation is not considered in the kernel-based approaches, fault detection can also be conducted using sequential data.
The same strategy as in (23) is adopted for grouping the features obtained by kernel PCA into PCs and residuals. The confidence level of control limits for T 2 and T 2 E is set as 95%. To evaluate the performance, the false alarm rate (FAR) defined in (24) on the cross-validation set is compared over the scaling factor δ values in Fig. 11 FAR = n FA n norm (24) where an anomaly is detected if the monitoring statistics of a test sample fulfill (22); otherwise, the test sample is labeled as normal. n FA denotes the number of normal samples being Fig. 11. Alarm rates on cross-validation sets of multiphase flow data.   detected as anomalies, and n norm denotes the number of all normal samples. When applied to monitoring, δ is set to be 60 for the RBF kernel and 100 for the NSDC kernel according to the strategy proposed in Section III-F. Fig. 11 shows that reduced δ values will lead to larger FARs on cross-validation sets, of which the samples are supposed to be normal, indicating the model being overfit. On the other hand, since the confidence level of control limits is selected as 95%, FARs below 5% imply that the monitoring model might have a higher missed detection rate (MDR) and, hence, be less sensitive to faults when the δ value is large. Monitoring statistics of the data set in normal operations obtained by Linear PCA, RBF-kernel PCA, and NSDC-kernel PCA are shown in Figs. 12-14. For the clarity of visualization, T 2 statistics are plotted in the logarithmic scale in Figs. 13 and 14. In Fig. 12, the T 2 statistic obtained by the linear PCA has a large increase, and the T 2 E statistic has a larger variance when the process switched from Mode A to Mode B while the statistics obtained by the RBF-kernel PCA and the NSDC-kernel PCA do not. Therefore, it is clear that the influence of multimodality in the training set can be reduced by applying kernel approaches. Moreover, as shown in Fig. 13, the T 2 E statistics obtained by the RBF-kernel PCA has a peak around sample 200 that lasts for approximately 30 samples. This may cause the detection threshold of T 2 E to be overly relaxed.
The second column in Table IV compares the FARs of the NSDC and the RBF kernels on the test sets that comprise normal samples that are not used for model training and parameter tuning. Since the confidence level is set to be 95%, the tuning strategy in Section III-F ensures that both the RBF and the NSDC kernels have an FAR that is close to 5% on the test set.
The fault detection time (DT), defined by (25), provides a measure of the sensitivity of the monitoring model in this example. Since the fault severity increased gradually in the experiment (e.g., the valve opening sequence in Fig. 15) and the variation in the process variables may not be visible in the early stage (see Fig. 10), it is difficult to define the faulty period clearly. Therefore, the DT is used instead of the MDR. Due to the persistent existence of the fault, we define that the fault detection occurs when a consecutive sequence of 20 samples exceeds the control limits. This reduces the influence of noise in the process measurements Table IV shows that the process monitoring model based on the NSDC kernels is capable of detecting the faults earlier when air blockage (F2) and diverted flow (F3) occur. Fig. 15 visualizes the sequence of valve opening adopted for seeding Fault 2 in operating mode B. In Figs. 16 and 17, the monitoring performance in mode B is compared against this sequence. In particular, T 2 E statistics can have an earlier detection when the blockage is less severe. It indicates that at the early stage of an incipient fault and when the deviation in PCs is not significant, small model-data mismatches existing in the process measurements caused by the fault can be captured by the monitoring model using the NSDC kernel. As a result,  the incipient fault can be detected and dealt with before severe performance degradation occurs in the process.

C. Comparison With Other Methods
Comparisons with other methods proposed for multimode process monitoring are challenging because it is difficult to implement and tune each method in a rigorous manner that ensures a fair comparison. In order to provide a fair evaluation of the performance of the NSDC kernel relative to the existing methods, the method is directly compared with the results reported in recently published approaches that adapt kernel-based methods for multimode process monitoring [10]- [12]. The simulated data sets described in each of these papers form the basis of the comparison. The FAR previously defined in (24) and the MDR defined in (26) are used to evaluate the anomaly detection performance MDR = n MD n anom (26) where n MD denotes the number of anomalous samples not being detected as anomalies and n anom denotes the total number of anomalous samples. The MDR is the rate of missed detections in a test set that includes anomalous samples. Table V presents the fault detection performance. The kernel widths of the RBF and the NSDC kernels and the control limits of T 2 and T 2 E are tuned using the strategy proposed in Section III-F according to the confidence levels used in [10]- [12]. The confidence levels of monitoring statistics are used as the expected FARs for both the cross validation and the test sets. Therefore, the FAR obtained by NSDC and RBF  [12] for the respective multimode process monitoring methods described in each article. It can be seen that the NSDC-kernel PCA approach achieves lower MDRs than both the RBF-kernel PCA and the methods presented in [10]- [12]. For the second test case, the NSDC also achieves a smaller FAR because of the artificial outliers in the training data [11].

A. Comparison With the RBF Kernel and Other Methods
Table VI compares the new NSDC kernel with the RBF kernel. The main advantage of the NSDC kernel over the RBF kernel is that due to the new assumptions in the convolution kernel formulation, the NSDC kernel is nonstationary and data-dependent. Hence, the NSDC kernel can handle the nonstationary covariance caused by multimodality.
Recent works on multivariate approaches for multimodal process monitoring deal with the multimodality in an ad hoc way, including the locally weighted approach [11], [30], [34] and using the local statistics matrix [10] or the residuals obtained by kernel regression [12] instead of original measurements. For those methods using the data-dependent kernels, their parameters need to be optimized properly [20]. In this article, we propose a new and systematic way of formulating the nonstationary kernel function via convolution and derive the closed-form solution, i.e., the NSDC kernel. Since the NSDC kernel only requires one parameter, i.e., the kernel width to be tuned, it also avoids additional parameterization of the kernel function and makes the training and online monitoring procedure easier.

B. Implementation Considerations
One can use the NSDC kernel for data-driven monitoring of processes with multiple operating modes, especially when the variations between modes are significant and there exist other types of nonlinearity in the processes. As shown in Fig. 5, the multimode training data need to be clustered when training the NSDC-kernel PCA model. This can be achieved either by incorporating the prior information of normal operating modes or by applying unsupervised clustering approaches, such as the k-means or the Dirichlet Process. Nevertheless, these clusters do not need to be precise if all data are from the normal operating conditions when applying the method for fault detection because the NSDC kernel can deal with other types of nonlinearity, as shown in Fig. 6. Moreover, the NSDC kernel can still be used when the clustering information is unavailable by setting the cluster number to 1.
To further optimize the scaling factor δ, an optimization problem can be formulated based on the cross-validation approach proposed in Section III-F. On the other hand, the scaling factor in NSDC may also allow for domain knowledge that is usually available in practice. For instance, if one normal operating mode is considered to be critical such that any violation or unobserved behavior must be quickly identified as a fault at the cost of an increased number of false alarms, and then, a smaller scaling factor for this mode can be selected in order to improve the sensitivity in its neighborhood. Conversely, NSDC kernels can also be adjusted such that a variable with lower measurement reliability can be downweighted by increasing the scaling factor toward its direction in all operating modes.

VI. CONCLUSION
This article has presented the NSDC kernel that is a novel type of the nonstationary data-dependent kernel function that is better suited for multimodal process monitoring. This NSDC kernel was defined as a covariance function by the discrete convolution on the normal data set only. The parameter specification of this NSDC kernel was also discussed. When compared to the RBF kernel under the kernel PCA framework, the NSDC kernel can yield a better monitoring model, which is robust to overfitting issues and more sensitive in fault detection. This approach directly benefits process monitoring by reducing false and missed alarm rates, as demonstrated in the industrial case study. Moreover, incipient faults seeded during operation of the industrial-scale multiphase flow facility were detected earlier using monitoring models based on the NSDC kernel. Being a data-dependent kernel that can account for process data from multiple operating modes, the NSDC kernel has only one parameter to be tuned, making it easier to apply the NSDC kernel to the data-driven process monitoring. The results in this article also suggest that use of the NSDC kernel may not be limited to multimodal process data. In general pattern recognition problems, if the training set is discrete with obvious multiple operating modes or the covariance structure is nonstationary, the NSDC kernel can be combined with unsupervised clustering approaches in order to address the aforementioned issues.