Quantum Expectation-Maximization Algorithm

Clustering algorithms are a cornerstone of machine learning applications. Recently, a quantum algorithm for clustering based on the k -means algorithm has been proposed by Kerenidis, Landman, Luongo and Prakash. Based on their work, we propose a quantum expectation-maximization (EM) algorithm for Gaussian mixture models (GMMs). The robustness and quantum speedup of the algorithm is demonstrated. We also show numerically the advantage of GMM over k-means for non-trivial cluster data.


I. INTRODUCTION
Quantum computing has attracted much attention since the discovery of Shor's algorithm [1,2].Recently, with the rapid developments in machine learning, physicists have started to consider utilizing quantum computers for machine learning applications [3][4][5][6][7][8].As a result, quantum machine learning has emerged as an interdisciplinary field between quantum computing and machine learning.Furthermore, a quantum algorithm for the kmeans algorithm [9,10] with proven quantum speedup was proposed [11].
The k-means algorithm is an essential tool in many machine learning applications [9,10].However, the k-means algorithm is as a special case of the more general Gaussian mixture model (GMM).In the k-means algorithm, each Gaussian has the same weight and the covariance matrix of each Gaussian function is the identity.As a result, the k-means algorithm may provide poor estimates of the clusters since the assumptions of the k-means algorithm are sometimes too strong to capture all properties of complex data sets.The expectation-maximization (EM) algorithm [9,10,12] and variational Bayes (VB) inference [9,10] with the GMM are often used to improve the clustering, since the general GMM can deal with a wider class of data sets.Recently, one of the authors proposed quantum-inspired algorithms for the EM algorithm [13][14][15] and VB [16].In Refs.[15,16], we have succeeded in improving the performances of the EM algorithm and VB.However, the aim of Refs.[15,16] is to make use of quantum fluctuations as a numerical tool and not to provide a quantum speedup over a classical algorithm; as a result, the computational costs are almost the same.
In this paper, we propose a quantum algorithm to estimate the parameters of the GMM, which we call the quantum EM (q-EM) algorithm.To this end, following the spirit of Ref. [11], we first introduce a randomized variant of the EM algorithm, i.e. the δ-EM algorithm which includes non-deterministic readout of the data.Then, we formulate a quantum algorithm that realizes a speedup of the δ-EM algorithm with respect to the number of data points.The q-EM algorithm may be an important step toward quantum machine learning, since the EM algorithm is an essential algorithm in machine learning.
This paper is organized as follows.In Sec.II, we provide classical preliminaries.In particular, we review the EM algorithm and then introduce the δ-EM algorithm.In Sec.III, we present the detailed procedure of the q-EM algorithm.Then, in Sec.IV, we show its computational cost.In Sec.V, we show numerical simulations of the δ-EM algorithm to confirm that our starting point is valid.In Sec.VI, we discuss the relationship between the EM algorithm and the k-means algorithm.Finally, Sec.VII concludes this paper.
Independently of our work, Iordanis Kerenidis, Alessandro Luongo, and Anupam Prakash proposed an extension of the q-means algorithm to Gaussian mixture models similar to this work using soft clustering [17].

II. CLASSICAL PRELIMINARIES
In this section, we first review the EM algorithm [9,10] in detail.We then introduce a randomized variant of the EM algorithm, which we call the δ-EM algorithm.The purpose of introducing the δ-EM algorithm is the need of a robust variant of the EM algorithm against noise which resembles the non-deterministic quantum measurement of a superposed state.

A. EM algorithm
The EM algorithm is a generic approach to estimate parameters of probability distributions based on maxi-  compute the responsibilities of cluster k on r i,k t , Eq. ( 2) 3), (4), and (5) t ← t + 1 7: end while mum likelihood estimation.For simplicity, we focus on the GMM and review the EM algorithm for the GMM.Let us consider a d-dimensional feature space and assume that we have N data points {y i } N i=1 .The GMM for x ∈ R d is given by where is the d-dimensional Gaussian function with mean µ k and covariance Σ k , and k π k = 1.To simplify the notation, we define θ := {π k , µ k , Σ k } K k=1 .The EM algorithm, which estimates θ, consists of the following two steps which are iterated until convergence.The first step, which is called the E step, is to compute the responsibilities of cluster k for each datapoint y i : . ( The second step, which called the M step, is to compute θ t by using the responsibilities, Eq. ( 2): We iterate the E and M step by substituting Eqs. ( 3), (4), and (5) until convergence.Note that we can begin either of the E step or the M steps for the first iteration.The EM algorithm is summarized in Algo. 1.Note that the procedure of the EM algorithm can be generalized for mixture models [9,10].

B. δ-EM algorithm
As a prerequisite to the q-EM algorithm, we need to modify the original EM algorithm, since we have to take into account randomness associated with quantum measurement.
In Sec.II A, we explained that the EM algorithm has two steps: the E and M steps.In the δ-EM algorithm, we modify the E step in the spirit of the δ-k-means-algorithm in Ref. [11].
To this end, we first introduce the square GMM distance by Note that, when π k = 1/K and Σ k is the identity matrix We then define the set of labels given by where In the E step of the δ-EM algorithm, we take random samples from L δ G (y i ) in Eq. (8).We note that for soft clustering, a more precise sampling scheme may be useful, but this simple sampling works well in the approach shown in Ref. [11].
In the M step of the δ-EM algorithm, we add small noise to the estimated parameters after their estimation.As a result, the δ-EM algorithm becomes robust, and its quantum version will become implementable.In Sec.V, we will show the validity of this algorithm numerically.Remarkably, we find that adding noise can even improve the quality of the studied benchmark examples.

III. QUANTUM ALGORITHM FOR THE EM ALGORITHM WITH THE GMM
In this section, we describe the procedure of the quantum algorithm that realizes a quantum speedup of the EM algorithm for the GMM.
To simplify the notation, we add the tilde for estimates throughout this paper; that is, we denote e.g.ã as the estimate of a.

A. Overview of the q-EM algorithm
We begin with the initialization of the q-EM algorithm.In the EM algorithm, we can begin either with the E step or the M step for the first iteration.For simplicity, in the case of the q-EM algorithm, we consider to start with the E step; then, we set the initial parameter set . The main procedure of the q-EM algorithm is composed of four steps.In step I, we compute the square ALGORITHM 2: q-EM algorithm.
1: t = 0 2: prepare for the data structures 3: while convergence criterion is not satisfied do generate the mean and covariance states (step III) update the parameters (step IV) 8: t ← t + 1 9: end while GMM distance, and in step II, it is minimized for cluster assignment.Then, in step III, we generate quantum states of weight vectors, mean vectors, and covariance matrices, and in step IV, we apply quantum vector state tomography.By using the classical information on mean vectors and covariance matrices, we repeat the whole procedures until convergence.The output of this algorithm is θ * = {π * , µ * , Σ * }.The q-EM algorithm is summarized in Algo. 2. In the rest of this section, we will explain the four steps in detail.

B. Step I: Computing the square GMM distance
In this step, we compute the square GMM distance, Eq. ( 6).Mathematically, we apply the unitary operation: where d k G (y i ) is the square GMM distance between y i and the k-th cluster, and [K] := {k} K k=1 .For this computation, we require the precision given by dk In the next section, 1 will be used to analyze the runtime.
Equation ( 9) includes summation, multiplication, and inner products of quantum states.Among them, the computation of summation and multiplication is straightforward with quantum linear algebra while the implementation of inner products is more involved.Let us thus focus on the computation of inner products.
We assume that two unitary operations and their controlled versions are available as follows: where Ĝk := ( Σk ) 1/2 for k = 1, 2, . . ., K. We then begin with the state By using controlled versions of Eq. ( 10) and ( 11), we create the following state from |φ i,k : Then we apply the Hadamard gate on the third register of |φ C i,k and the resulting state is Note that Eq. ( 14) is also represented as where |tar i,k , 1 := |1 Ĝk (|y i − |µ k ) and gar i,k , 0 is a garbage state.That is, we have a unitary operator such that We also note that the probability that we get |1 by measuring the third register is expressed as We then apply all the operations except the measurement in amplitude estimation in Ref. [11,18] on Û1 in Eq. ( 16).This process realizes the following unitary operation: where pi,k and α > 8 π 2 [18].Here M is a parameter to be determined (see Sec. B 1). Next, applying the mode evaluation method [19] in Lemma 8 of Ref. [7] and Thm.2.2 of [11] to Eq. ( 18), we get |Φ i,k such that The last step is to estimate the square GMM distance of unnormalized vectors y i and µ k and to multiply the norms of them and adding ln π k .A translation operator T (r ) can conduct the adding operation: T (r ) |r = |r + r for r, r ∈ R N .Note that we have assumed that we know the norms of { y i } N i=1 and the same assumption is used in Ref. [11].

C. Step II: Assignment of clusters
The purpose of step II is cluster assignment.In this step, we utilize the following unitary operation: where |a k is a (ln p)-bit state for k = 1, 2, . . ., K. The computational cost of this operation is O(K ln p) [11].
To find cluster assignment, we perform the unitary operation given by Û4 : 1 where label t (y i ) is the optimal label of y i at time t.Finally, by uncomputing the square GMM distances, we obtain This uncomputation is required to repeat iterations.

D.
Step III: Generation of the mean and covariance states In this step, we generate states that store information on the weights, mean vectors, and covariance matrices.Let us recall that Eq. ( 22) is also expressed as Thus, by measuring the label register of |ψ t in Eq. ( 22), we obtain, with probability N k t /N , where C k t is the set of labels that belong to cluster k at time t.Then, for i = 1, 2, . . ., N .
We finally deal with {π k } K k=1 ; it is relatively easy to compute the weights of the GMM, {π k t } K k=1 .We here utilize Eq. ( 29) as follows: Thus, we can estimate N k t similarly.Note that we assume that the sizes of all clusters are Ω(N/k).

E. Step IV: Update of the parameters
At the end of each iteration, we obtain classical information on π t+1 , µ t+1 , and Σ t+1 by performing the quantum state tomography algorithm for |χ k t , |µ k t+1 , and |vec[Σ k t+1 ] .Quantum vector state tomography is explained in Ref. [11].The quantum state tomography algorithm in Ref. [11] requires a unitary transformation U : |0 → |x ; however, the procedure to find l(y i ) is not deterministic.Then, we have to devise some deterministic methods to find l(y i ).One solution is to determine l(y i ) by the rule and we discard the points to which no label can be assigned.

IV. ANALYSIS OF ERRORS AND RUNTIME
This section is dedicated to error and runtime analysis of the q-EM algorithm.We first state the main claim and then explain it.

A. Main result
The runtime of the q-EM algorithm is represented by where µ(•) is given in Eq. (B8), κ(•) is the condition number, η µ := max i y i 2 , and This result states that the runtime of each iteration of the q-EM algorithm is exponentially faster than that of the EM algorithm.

B. Error analysis
We first summarize the errors in the q-EM algorithm to analyze the total runtime of the q-EM algorithm in the following subsection.In step I, we compute d k G (y i ); the error on this computation is For consistency between the EM algorithm and the δ-EM algorithm, we take 1 < δ/2.
In steps III and IV, we compute µ and Σ.The errors on µ k and |µ k are √ η µ µ 3 and µ 4 , respectively.Then, the error on the estimation of µ takes the form See also Appendix C for the above calculation.We need to take µ 3 < δ 4 √ η µ and µ 4 < δ 4 √ η µ .Next, we turn our attention to Σ.The errors on η Σ Σ 3 and η Σ Σ 4 are η Σ Σ 3 and Σ 4 , respectively.Similarly to the case of µ k , the error on the estimation of Σ is shown as We also need to take Finally, we mention the error associated with the estimation on π k for k = 1, 2, . . ., K. The error on π k is shown as We estimate {π k } K k=1 via the distribution of labels in quantum vector state tomography.

C. Runtime
In the following, Eq. (34), i.e the runtime of each iteration of the q-EM algorithm is derived using Hoeffding's inequality The required number of quantum vector state tomography of K mean vectors is given as follows [11]: Similarly, that for covariance matrices is The definition of Õ(•) is given in Appendix A. Next, we turn our attention to the runtime to prepare single copies of |µ k and |vec[Σ k ] .The time to prepare a copy of and that to prepare a copy of |vec where Furthermore, T µ χ , which is the time to prepare |χ k t for estimating {µ k } K k=1 , is given by Similarly, T Σ χ is given by In addition, T π χ is given by We also need to estimate the norms of |µ k and |vec[Σ k ] .The time for the norm estimation of We then estimate the runtime for estimating {π k } K k=1 .Due to Hoeffding's inequality [22], we need to perform sampling 2K (52) Furthermore, we have to repeat the estimation process K times for estimation of π compared to those of µ and Σ, since we have to sample i from C k , K times, for k = 1, 2, . . ., K. Thus, the runtime for estimating π has an additional multiple of K. Thus, the total runtimes for estimating {π k }, {µ k }, and {Σ k } are, respectively, In total, we have obtained Eq. (34).

V. NUMERICAL SIMULATION
To devise a quantum version of the EM algorithm, we proposed the δ-EM algorithm in Sec.II.In this section, to see that the EM and δ-EM algorithms are equivalent when δ is sufficiently small and that the δ-EM algorithm improves upon the δ-k-means algorithm, we show numerical simulations of the EM algorithm, the δ-EM algorithm, the k-means algorithm, and the δ-k-means algorithm.
In Ref. [23], the comparison between the k-means algorithm and the EM algorithm with GMM is shown.Then we use similar synthetic data sets used in Ref. [23].In the numerical simulations, we set δ = 0.2 except the numerical simulation for the δ-dependence of the δ-EM algorithm.For simplicity, we add Gaussian noise to the parameters estimated in the M step of the δ-EM algorithm and the centroids estimated in the δ-k-means algorithm [24].

A. Example I
We begin with the explanation of the data set used in this subsection.We generated by drawing 1000 data points from the mixture of two Gaussian functions.The means of the two Gaussian functions are µ 1 = [0.3,0.0] and µ 2 = [−0.3,0.0] , respectively, and the covariances are, respectively, We also put π 1 = π 2 = 0.5.In Figs. 1 and 2, we show the log-likelihood of the kmeans algorithm and the δ-k-means algorithm, and that of the EM algorithm and the δ-EM algorithm, respectively.We plot ten trials for each algorithm.These figures show that the k-means and δ-k-means algorithms have similar performance and that the EM and δ-EM algorithms have similar performance.
For clarity, we graphically show the parameters estimated by the δ-kmeans algorithm and the δ-EM algorithm in Fig. 3 [25].We have chosen the best estimates of the δ-k-means algorithm and the δ-EM algorithm in one hundred trials.These figures demonstrate that the δ-EM algorithm outperforms the δ-k-means algorithm.
In Table I, we summarize the success rates of the EM algorithm, the δ-EM algorithm, the k-means algorithm,  and the δ-k-means algorithm.Here, the success rate means that the ratio of the number of the successfully predicted hidden variables [26] to the number of the total data points.This table shows that the δ-EM algorithm works better than the δ-k-means algorithm.Thus, we insist that it is meaningful to devise a quantum version of the δ-EM algorithm.In Fig. 4, we show the δ-dependence of the best success rates of the δ-EM algorithm in one hundred trials.This figure shows that the δ-EM algorithm is robust for small δ, but the performance decreases rapidly for large values of δ.We need to set δ small, since the critical value depends on data sets.

B. Example II
We again start with the data set used in this subsection.The data points are also generated by the mixture of two Gaussian functions, but the parameters are different.We set [π 1 , π 2 ] = [0.7,0.3], µ 1 = [0.0,−0.5] , µ 2 = [0.0,0.0] , and EM δ-EM k-means δ-k-means 93.9 % 94.3 % 72.4 % 72.5 % TABLE I: Success rates of the EM algorithm, the δ-EM algorithm, the k-means algorithm, and the δ-k-means algorithm.These scores are best ones in one hundred trials with randomized initial inputs.
Furthermore, we draw 1000 data points from the mixture of two Gaussian functions.We first show the parameters estimated by the δ-kmeans algorithm and the δ-EM algorithm in Fig. 5.We have chosen the best estimates of the δ-k-means algorithm and the δ-EM algorithm in one hundred trials.These figures represent that the δ-EM algorithm outperforms the δ-k-means algorithm.In particular, in the case of the δ-k-means algorithm, the covariances are fixed at the identity matrix; then, each cluster tries to exclude each other.
In Table II, we summarize the success rates of the EM algorithm, the δ-EM algorithm, the k-means algorithm, and the δ-k-means algorithm.Here, the success rate means that the ratio of the number of the successfully predicted labels to the number of the total data points.This table shows that the δ-EM algorithm works better than the δ-k-means algorithm.
EM δ-EM k-means δ-k-means 88.8 % 89.2 % 57.9 % 55.4 % TABLE II: Success rates of the EM algorithm, the δ-EM algorithm, the k-means algorithm, and the δ-k-means algorithm.These scores are best ones in one hundred trials with randomized initial inputs.

VI. DISCUSSIONS
We here discuss the relationship between the EM algorithm with the GMM and the k-means algorithm.The EM algorithm with the GMM is an extension of the kmeans algorithm; thus we explain the two conditions that the EM algorithm with the GMM becomes identical to the k-means algorithm.
The first condition is that r i,k t takes 1 for a certain k and 0 otherwise.This implies that the k-means algorithm is an algorithm for hard clustering, while the EM algorithm is one for soft clustering.The second condition is that π k = 1/K and Σ k = I d where I d is the d-dimensional identity matrix for k = 1, 2, . . ., K.This is the reason why the k-means algorithm does not explicitly deal with weights and covariance matrices.From the viewpoint of a probability distribution, the k-means algorithm is an algorithm to estimate As shown in Sec.V, the weights and the covariances of the GMM play an important role; thus, we also insist that the δ-EM algorithm is a meaningful extension of the δ-k-means algorithm.

VII. CONCLUSION
In this paper, we have proposed a quantum algorithm for the EM algorithm and showed that it realize a quantum speedup compared to the classical EM algorithm.The key idea is to generalize the distance that is minimized in the k-means algorithm by considering also weights and covariances.Though we have focused on the GMM, we can generalize this condition to other mixture models.In machine learning, the EM algorithm with the GMM is more often used than the k-means algorithm; thus, this work is an important step toward quantum machine learning.The algorithm requires a QRAM oracle which has so far not been implemented in experiments yet.As a future direction we will investigate the applicability of novel superposition designs as proposed in Refs.[27,28].

Median evaluation
The time complexity of the mode evaluation algorithm is given in Lemma 8 of Ref. [7]

Quantum random access memory
In the q-means and q-EM algorithm, it is crucial to prepare data as a quantum state efficiently.To this end, we exploit the quantum random access memory (QRAM) introduced in Refs.[30,31].Here, the authors consider a device that performs the operation (B5) We follow the application of the QRAM as in Ref. [11].Let V 1 ∈ R N ×d ; then, there is a data structure to store the rows of V 1 such that the time to insert, update, or delete a single entry v i,j is O(ln 2 N ) and a quantum algorithm on the data structure can be performed in time O(ln 2 N ) that realizes the following unitaries: Some useful subroutines that are used in q-means and q-EM are given as follows: Theorem 1.Let M ∈ R d×d that satisfies M 2 = 1 and x ∈ R d .If M is stored in QRAM and the time to prepare |x is T x , then there exist quantum algorithms that return M p i,j , (B9) and κ(M ) is the condition number of M .

FIG. 1 :
FIG.1: Log-likelihood ofthe k-means algorithm (red lines) and the δ-k-means algorithm (green lines).We perform the simulation ten times, respectively.The difference of (a) and (b) is the scale of the vertical axis.

FIG. 2 :
FIG.2: Log-likelihood of the EM algorithm (red lines) and the δ-EM algorithm (green lines).We perform the simulation ten times, respectively.The difference of (a) and (b) is the scale of the vertical axis.

FIG. 4 :FIG. 5 :
FIG.4: δ-dependence of success rates.Each success rate is the best one in one hundred trials.
a |D j d .