Group metropolis sampling

Monte Carlo (MC) methods are widely used for Bayesian inference and optimization in statistics, signal processing and machine learning. Two well-known class of MC methods are the Importance Sampling (IS) techniques and the Markov Chain Monte Carlo (MCMC) algorithms. In this work, we introduce the Group Importance Sampling (GIS) framework where different sets of weighted samples are properly summarized with one summary particle and one summary weight. GIS facilitates the design of novel efficient MC techniques. For instance, we present the Group Metropolis Sampling (GMS) algorithm which produces a Markov chain of sets of weighted samples. GMS in general outperforms other multiple try schemes as shown by means of numerical simulations.


I. INTRODUCTION
Many applications in statistical signal processing, machine learning and statistics, require the computation of a-posteriori estimators induced by complicated posterior probability distributions [1], [2].The approximation of these estimators needs often the use of Monte Carlo methods [3]- [5].The most popular MC approaches are the Importance Sampling (IS) methods and the Markov chain Monte Carlo (MCMC) algorithms [1], [4].IS schemes produce a random discrete approximation of the posterior distribution by a population of weighted samples [4], [6], [7].MCMC techniques generate a Markov chain (i.e., a sequence of correlated samples) with a pre-established target probability density function (pdf) as invariant density [5], [8].
In this work, we introduce the Group Importance Sampling (GIS) framework where different sets of weighted samples can be properly summarized with one summary particle and one summary weight.This idea has been indirectly and implicitly employed in different Monte Carlo schemes: parallel particle filters [9], [10], particle island and related methods [11]- [13], tracking and model selection algorithms [14], nested sequential Monte Carlo schemes [15], [16] are some examples.
Furthermore, we also show that the GIS theory facilitates the design of novel efficient Monte Carlo techniques.As an example, we present the Group Metropolis Sampling (GMS) algorithm that generates a Markov chain of sets of weighted samples.All these resulting sets of samples are jointly employed obtaining a unique particle approximation of the target distribution.On the one hand, GMS can be considered as an MCMC method since it produces a Markov chain of sets of samples.On the other hand, the GMS can be also considered as an iterated importance sampler where different estimators are finally combined in order to build a unique IS estimator.This combination is obtained dynamically through random repetitions given by MCMC-type acceptance tests.GMS is closely related to Multiple Try Metropolis (MTM) techniques and Particle Metropolis-Hastings (PMH) algorithms [17]- [22], as we discuss below.The GMS algorithm can be also seen as an extension of the method in [23], for recycling auxiliary samples in a MCMC method.
The paper has the following structure.Section II recalls some background material.The GIS theory is introduced in Section III.In Section IV, we present the GMS algorithm.Section V provides some numerical results and in Section VI we discuss some conclusions.

II. PROBLEM STATEMENT AND BACKGROUND
In many applications the goal is to infer a variable of interest, x 2 X ✓ R dx , given a set of related observations or measurements, y 2 R dy .The statistical information is summarized in the posterior probability density function (pdf) given by where `(y|x) is the likelihood function, g(x) is the prior pdf and Z(y) is the marginal likelihood (a.k.a., Bayesian evidence).In general Z is unknown and often impossible to compute, so we only assume to be able to evaluate the unnormalized target function,1 The computation of integrals involving ⇡(x) = 1 Z ⇡(x) are often intractable.We consider the problem of approximating via Monte Carlo a complicated integral involving the target ⇡(x) and an integrable function h(x) with respect to ⇡, i.e.,

A. Importance Sampling
Let us consider a proposal density q(x), 2 The importance sampling (IS) method consists of drawing N samples, x 1 , . . ., x N , from q(x) (also called particles in this work), and then assign to each sample the following unnormalized weights If Z is known, a possible (unbiased) IS estimator [4], [5] is given by b If Z is unknown, defining the normalized weights, wn = wn P N i=1 wi with n = 1, . . ., N, an alternative self-normalized (biased) IS estimator is Both b I N and I N are consistent estimators of I in Eq. ( 3) [4], [5].Moreover, an unbiased estimator of marginal likelihood,

B. Concept of proper weighting
The standard IS weights in Eq. ( 4) are broadly used in the literature.However the definition of proper weighted sample can be extended as suggested in [4, Section 14.2], [5, Section 2.5.4], and [24].More specifically, given a set of samples, they are proper weighted with respect to the target ⇡ if, for any square integrable function h, where c is a constant value, also independent from the index n, and the expectation of the left hand side is performed, in general, w.r.t. the joint pdf of w(x) and x, i.e., q(w, x).Namely, the weight w(x), (for a given value of x), could even be considered a random variable.Thus, in order to obtain consistent estimators, one can design any joint q(w, x) as long as the restriction of Eq. ( 7) is fulfilled.

C. The Independent Metropolis-Hastings (IMH) algorithm
The Metropolis-Hastings (MH) method [4], [5], [25] is one of the most popular MCMC algorithm.It generates a Markov chain {x t } 1 t=1 with ⇡(x) as stationary density.Considering a proposal pdf q(x) independent from the previous state x t 1 , the independent MH method is given in Table I.
Observe that i in Eq. ( 8) involves the ratio between the importance weight of the proposed samples v 0 and the importance weight of the previous state x t 1 .Note that at each iteration only one new sample v 0 is generated to be compared with the previous state x t 1 by the acceptance probability ↵(x t 1 , v 0 ).

III. GROUP IMPORTANCE SAMPLING (GIS)
Let us consider the M disjoint sets of weighted samples (a.k.a., particles) The IMH algorithm Initialization: Choose an initial state x 0 .For t = 1, . . ., T : 1) Draw a sample v 0 ⇠ q(x).
2) Accept the new state, x t = v 0 , with probability where w(x) = ⇡(x) q(x) (importance weight).Otherwise, set where x m,n ⇠ q m (x) i.e., a different proposal pdf for each set S m .In the most general case we consider that N i 6 = N j , 8i 6 = j with i, j 2 {1, ..., M}.We can summarize the statistical information of each set using a pair of summary sample, e x m , and summary weight, W m , m = 1, . . ., M, in such a way that the following estimator is a consistent estimator of I. We denote the importance weight of the n-th sample in the m-th group as and the normalized weights within a set as wm,n = wm,n Nm b Zm , for n = 1, . . ., N and m = 1, . . ., M.

Definition 1. A summary particle e
x m for the m-group is a resampled particle, i.e., e x m is selected within {x m,1 , . . ., x m,Nm } according to the probability mass function (pmf) defined by wm,n , n = 1, . . ., N m .
It is possible to use the Liu's definition in order to assign a proper importance weight to a resampled particle [26], as stated in the following theorem.The proof of this theorem is given in [27] and further discussions in [26].
Definition 2. The summary weight for the m-th group of , defined in Eq. (11).
Given the M summary pairs {e x m , W m } M m=1 in a common computational node, we can obtain the following particle approximation of ⇡(x), i.e., b ⇡(x|e involving M weighted samples in this case.For a given function h(x), the corresponding specific GIS estimator in Eq. ( 10) is It is a consistent estimator of I. Indeed, the expression in Eq. ( 14) can be interpreted as a standard IS estimator (then consistent) since e w(e x m ) = b Z m is a proper weight of a resampled particle [26].Moreover, we are giving more importance to the resampled particle belonging to a set with more cardinality.The joint use of the concepts of summary particle and summary weight is not strictly needed.In some application, both are required whereas in other applications only one of them is employed as we shown in the next section.

IV. GROUP METROPOLIS SAMPLING (GMS)
In this section, we show how GIS facilitates the design of novel efficient techniques.More specifically, we use the concept of summary weight associated to a set of samples in order to generalize the IMH algorithm in Table I.Unlike in the IMH scheme, GMS produces a sequence of sets of weighted samples.The Group Metropolis Sampling (GMS) is shown in Table II.Note that the GMS algorithm uses the idea of summary weight for comparing sets.
Table II Group Metropolis Sampling (GSM) and b Z t = b Z 0 , with probability Otherwise, set S t = S t 1 and b Given the generated sets S t = {x n,t , ⇢ n,t } N n=1 , for t = 1, . . ., T , GMS provides the global particle approximation Relationship with IMH.The acceptance probability ↵ in Eq. ( 15) is the extension of the acceptance probability of IMH in Eq. ( 9), considering the proper GIS weighting of a set of weighted samples (note that, in GMS, all the sets are the same number of samples).
Relationship with multiple try methods.GMS is strictly related to Multiple Try Metropolis (MTM) schemes [19], [21], [28], [29] and Particle Metropolis Hastings (PMH) techniques [17], [29].The difference between GMS and the PMH and MTM methods is that GMS does not use resampling steps at each iteration for generating summary samples, indeed GMS uses the entire set.Another difference with PMH is that PMH generates sequentially the set of N candidates using a Sequential Importance Resampling (SIR) procedure (so that N candidates are correlated in PMH, in general) [29].However, considering a sequential of a batch procedure for generating the N tries at each iteration, we can recover a MTM (or PMH) chain by the GMS output applying T resampling steps, Namely, {e x t } T t=1 is the chain obtained by one run of the MTM (or PMH) technique.Ergodicity.As also discussed above, the acceptance probabilities and the dynamics of GMS exactly coincides with the PMH or MTM steps (with a sequential or batch particle generation, respectively), so that the ergodicity of the chain is ensured [17], [19], [21], [29].Indeed, we can recover the MTM (or PMH) chain as shown in Eq. ( 17).Recycling samples.The GMS algorithm can be seen as a method of recycling auxiliary weighted samples in PMH and MTM schemes.In [23], the authors show how recycling and including the rejected samples in a MH run into a unique consistent estimator.GMS can be considered an extension of this technique where, unlike in [23] N samples are generated at each iteration.Iterated IS.GMS can be also interpreted as an iterative importance sampling scheme where an IS approximation of N samples is built at each iteration and compared with the previous IS approximation.This procedure is iterated T times and all the accepted IS estimators are finally combined for providing a unique global approximation of NT samples.Note that, the temporal combination of the IS estimators is obtained dynamically by the random repetitions due to the rejections in the MH test.

V. NUMERICAL SIMULATIONS
We test the proposed GMS approach for the estimation of hyperparameters of a Gaussian process (GP) regression model [30], [31].Let us assume observed data pairs {y j , z j } P j=1 , with y j 2 R and z j 2 R L .We also denote the corresponding P ⇥ 1 output vector as y = [y 1 , . . ., y P ] > and the L ⇥ P input matrix as Z = [z 1 , . . ., z P ].We address the regression problem of inferring the unknown function f which links the variable y and z.Thus, the assumed model is y = f (z) + e, where e ⇠ N (e; 0, 2 ), and that f (z) is a realization of a GP [31].Hence f (z) ⇠ GP(µ(z), (z, r)) where µ(z) = 0, z, r 2 R L , and we consider the kernel function Given these assumptions, the vector f = [f (z 1 ), . . ., f(z P )] > is distributed as p(f |Z, , ) = N (f ; 0, K), where 0 is a P ⇥ 1 null vector, and K ij := (z i , z j ), for all i, j = 1, . . ., P , is a P ⇥ P matrix.Therefore, the vector containing all the hyper-parameters of the model is ✓ = [ , ], i.e., all the parameters of the kernel function in Eq. ( 18) and standard deviation of the observation noise.In this experiment, we focus on the marginal posterior density of the hyperparameters [31], p(✓|y, Z, ) / p(✓|y, Z, ) = p(y|✓, Z, )p(✓), which can be evaluated analytically, but we cannot compute integrals involving it.Considering a uniform prior within [0, 20] 2 and since p(y|✓, Z, ) = N (y; 0, K + 2 I), we have where clearly K depends on [31].The moments of this marginal posterior cannot be computed analytically.We generated P = 200 pairs of data, {y j , z j } P j=1 , according to the GP model setting ⇤ = 3, ⇤ = 10.L = 1, and drawing z j ⇠ U([0, 10]).Keeping fixed the generated data for each scenario, we then computed the ground-truth b ✓ ⇡ [ b ⇡ 3.5200, b ⇡ 9.2811] using an exhaustive and costly grid approximation, in order to compare the different techniques.For both GMS and MTM schemes, we consider the same adaptive Gaussian proposal pdf q t (x|µ t , 2 I) = N (x|µ t , 2 I), with = 5 and µ t is adapted considering the arithmetic mean of the outputs after a training period, t 0.2T , in the same fashion of [32], [33] (µ 0 = [1, 1] > ).First, we test both techniques fixing T = 20 and varying the number of tries N .Then, we set N = 100 and vary the number of iterations T .Figure 1 (log-log plot) shows the Mean Square Error (MSE) in the approximation of b ✓ averaged over 10 3 independent runs.Observe that always GMS outperforms the corresponding MTM scheme.

VI. CONCLUSIONS
In this work, we introduce the Group Importance Sampling (GIS) theory which facilitates the design of novel Monte Carlo algorithms.For instance, we present the Group Metropolis Sampling (GMS) method that outperforms the corresponding benchmark Monte Carlo techniques without any extra computational cost, as we have shown in an hyperparameter estimation problem for GP regression models.

Theorem 1 .
Let us consider a resampled particle e x m ⇠ b ⇡ m (x).A proper unnormalized weight following the Liu's definition in Eq. (7) for this resampled particle is e w m = e w(e x m ) = b Z m .

Figure 1 .
Figure 1.MSE (loglog-scale; averaged over 10 3 independent runs) obtained with the MTM and GMS algorithms (left) as function of N fixing T = 20 and (right) as function of T setting N = 100.