Non-local spatially varying finite mixture models for image segmentation

In this work, we propose a new Bayesian model for unsupervised image segmentation based on a combination of the spatially varying finite mixture models (SVFMMs) and the non-local means (NLM) framework. The probabilistic NLM weighting function is successfully integrated into a varying Gauss–Markov random field, yielding a prior density that adaptively imposes a local regularization to simultaneously preserve edges and enforce smooth constraints in homogeneous regions of the image. Two versions of our model are proposed: a pixel-based model and a patch-based model, depending on the design of the probabilistic NLM weighting function. Contrary to previous methods proposed in the literature, our approximation does not introduce new parameters to be estimated into the model, because the NLM weighting function is completely known once the neighborhood of a pixel is fixed. The proposed model can be estimated in closed-form solution via a maximum a posteriori (MAP) estimation in an expectation–maximization scheme. We have compared our model with previously proposed SVFMMs using two public datasets: the Berkeley Segmentation dataset and the BRATS 2013 dataset. The proposed model performs favorably to previous approaches in the literature, achieving better results in terms of Rand Index and Dice metrics in our experiments.


Introduction
Image segmentation is one of the most important core problems in computer vision, with an extended research history.Unsupervised learning has historically played a key role in the image segmentation task, constituting one of the first paradigms to automatically identify objects and structures in an image (Zhang et al. 2008).Specifically, clustering has gathered most of the efforts in unsupervised image segmentation research.Clustering is the task of finding natural groupings of data within a population, sharing a similar set of properties (Rokach and Maimon 2005).Many clustering techniques have been proposed in the literature during the past decades (Saxena et al. 2017), ranging from distance-based techniques such as partitional clustering or hierarchical clustering; density-based techniques such as DBSCAN (Ester et al. 1996) or mean shift (Cheng 1995); graph-based algorithms such as graph-cuts (Boykov et al. 2001); or probabilistic models such as (FMMs) (Pal and Pal 1993).
Specifically, probabilistic models intend to learn the probability density function (pdf) of an image by means of fitting a multi-parametric statistical model to the data.In particular, FMMs fit a weighted sum of probabilistic distributions, each one representing a component of the image, to capture the heterogeneity nature of the image information.Gaussian mixture models (GMMs) are the most extended FMMs, being widely employed for image segmentation, as they have proven to successfully capture the complexity of an image (Juan-Albarracín et al. 2015).Moreover, GMMs can be efficiently estimated by means of maximum likelihood esti-mation via the expectation-maximization (EM) algorithm (Dempster et al. 1977).
However, images are structured arrangements of data in which, in addition to the pixel intensities, the location of the pixels provides important information to properly understand its content.Images show patterns of local regularity and spatial intensity redundancy that enclose the idea that adjacent pixels tend to belong to the same semantic object.Conventional FMMs, by the opposite, do not inherently take into account this information.FMMs make a heavy assumption that data in an image is i.i.d., ignoring the spatial information that has demonstrated to be useful to generate more accurate and realistic results.
To overcome this limitation, several solutions have been proposed in the literature (Blake and Rother 2011).Most of them rely on the inclusion of a Markov random field (MRF) to model the local dependencies between pixels in an image.Specifically, a variant to the FMM called SVFMM was proposed by Sanjay-Gopal and Hebert (1998), which replaces the classics mixing coefficients of the FMM by contextual mixing coefficients for each pixel of the image.This approximation allows to introduce a continuous MRF over these contextual mixing coefficients to incorporate the idea that neighboring pixels tends to share the same component.
Many variants of MRFs have been proposed in the literature to capture the local information contained in an image.Nikou et al. (2007) proposed a family of Gauss-Markov random fields (GMRFs), successfully achieving better results than the classic FMMs.However, such approximation introduces a local isotropic smoothing over the contextual mixing coefficients that ignores the presence of edges in the image.Therefore, the contextual mixing coefficients estimated under the GMRF approximation are iteratively smoothed, yielding prior probability maps that lose the information of image edges.Sfikas et al. (2008) proposed a t-Student MRF that allowed to regulate the smoothing between pixels in an edge.However, this approximation introduces new parameters to be estimated in the model, yielding a non-closed-form analytic solution for it.
In this work, we present a fully Bayesian SVFMM model, called non-local spatially varying finite mixture model (NLSVFMM), that combines the SVFMM framework with the NLM filtering schema.Our proposed model has 2 variants: the pixel-wise version (NLv-SVFMM) and the patch-wise version (NLp-SVFMM).The model introduces a GMRF weighted by the probabilistic NLM function proposed by Wu et al. (2013) to adaptively adjust the spatial regularization depending on the structure of the image.Such approximation avoids the introduction of new parameters, reducing the degrees of freedom of the model and the number of samples required for a reliable estimation of the parameters.Results obtained show that our method performs favorably to other SVFMM approaches in terms of Rand Index (RI) and Dice metrics.

Background on spatially variant finite mixture models
The SVFMM is a modification of the classic FMM, mainly oriented to image data, in which the coefficients of the mixture are extended for each pixel of the image.Let X = x 1 , . . ., x N a set of observations corresponding to the pixels of an image, where x i ∈ R D and represents a vector of D features for the i th pixel.The SVFMM is defined as: where φ x i |θ j is a pdf used to model the data, Π = ß 1 , . . ., ß N are called the contextual mixing coefficients, with and Θ = θ 1 , . . ., θ K , ß 1 , . . ., ß N is the set of parameters of the model.A maximum a posteriori (MAP) estimate of Θ is employed to impose a proper prior over Π to introduce the idea that neighboring pixels in an image tend to belong to the same component.
Several variants of p (Π ) has been previously proposed in the literature.Specifically, the directional class-adaptive GMRF, which we have used as the basis of our approach, takes the form: where β 2 j,d is the variance of the corresponding Gaussian instance and M i d is the set of neighbors of the i th observation that lies in the d direction.Typically, 4 directions can be considered in 2-D images, which correspond to the horizontal, vertical and diagonals.
In short, the directional class-adaptive GMRF encodes the idea that differences between adjacent mixing coefficients are Gaussian distributed in the form Inference on this model is not analytically tractable so numerical optimization methods must be employed.Thus, a MAP-EM algorithm is typically used to iteratively estimate the parameters of the model.The corresponding Q function for the EM schema is: The posterior density of the hidden variables Z = z 1 , . . ., z N , associated with each component, is computed at the E-step as follows: while at the M-step the parameters of the model Θ, when φ x i |θ j ∼ N x i |μ j , Σ j , are estimated by The updates for contextual mixing coefficients π i j (t+1) are obtained as the roots of the following second-degree equation which always have a real nonnegative solution.However, a limitation of this estimation is that it does not take into account the constraint that j π i j = 1, ∀i.Instead, reparatory techniques such as the one proposed by Blekas et al. (2005) must be employed to ensure the probabilities to sum 1.
An interesting alternative to avoid reparatory projections is to consider that Π is governed by a Dirichlet compound multinomial (DCM) distribution.This means that the hidden random variable z i is governed by a multinomial distribution with parameters ß i , which in turn is governed by a Dirichlet distribution with parameters i .Nikou et al. (2010) demonstrated that, following such hierarchical model, π i j can be computed as yielding a fully Bayesian model that always guarantees that j π i j = 1 ∀i.The parameters of the Dirichlet distribution only require to satisfy that α i j > 0 ∀i, j, making easier its optimization.Thus, a directional class-adaptive GMRF density can be imposed over A = 1 , . . ., N to enforce local regularity Hence, the Q function associated with the DCM-SVFMM becomes Optimizing the corresponding Q function for this model yields identically updates for ¯(t+1) where Finally, β 2 j,d is estimated by

on probabilistic non-local means
The NLM filter (Buades et al. 2005) proposes a schema for image filtering where pixels are restored by a weighted sum of similar neighbor patches.The core of NLM schema is the computation of the weight function between patches, which has taken a lot of variants in the literature.Specially, Wu et al. (2013) derived the probabilistic version of the NLM algorithm and its associated probabilistic weighting function.
Linking the description of the probabilistic NLM with the SVFMM background, let us consider d i,m j,d the distance between a pair of adjacent Dirichlet parameters Assuming that local differences are i.i.d., we have d i,m j,d ∼ χ 2 (1).For a patch-based version of the algorithm, the distance between two patches centered at ith and mth locations is defined as where P is the set of offsets that define a local patch around a given pixel.If patches are completely disjoint, then D i,m j,d ∼ χ 2 (|P|); however, in most cases, overlapping occurs between patches, so the i.i.d.assumption does not hold.In such cases, an approximation to the sum of a set of correlated χ 2 distributions can be computed as where with O i,m the set of overlapping pixels between the patches centered at i th and m th pixels.
Hence, the weight function u i,m j,d proposed in the probabilistic NLM approach is defined as  2008) proposed a variant of the SVFMM where local differences between Dirichlet parameters follow a t-Student distribution.Such an approach was intended to exploit the heavy-tailed nature of the t-Student distribution, to perform a robust estimation of the Dirichlet coefficients when edges and structures are present in their local neighborhoods.
Following the Bishop's development in Bishop (2006), a S t distribution can be expressed as This model introduces a new set of latent variables g i,m j,d , whose posterior density should be estimated at the E-step, and a new set of parameters ν j,d , with non-closed-form analytic estimation.Therefore, numerical optimization methods should be employed to estimate ν j,d .
In this sense, we propose the NLSVFMM as a modification of this model by replacing the g i,m j,d random variable by the probabilistic NLM u i,m j,d weight.Therefore, we propose to reformulate the local differences between contextual Dirichlet parameters in the form with D i,m j,d being latent variables of the model.Following the conventional EM scheme, the posterior densities of D i,m j,d should be calculated at the E-step.However, this leads to a different calculation of D i,m j,d than the proposed by Wu et al. (2013) (see Sect. 3).Therefore, in order to preserve the use of the original NLM weights, we will follow a variational EM approach (Neal and Hinton 1999;Bishop 2006).The variational EM framework introduces the concept of partial E-step, in which a functor of the latent variables can be used when the posterior densities of these variables cannot be calculated, or when it is desirable to calculate them differently for reasons of efficiency or performance.As demonstrated by Neal and Hinton (1999), such functor can take any form as long as the log-likelihood function is increased at each iteration, effectively driving the model to a local optimum of the function, and hence to an optimum of the parameters of the model.Therefore, following this framework, the D i,m j,d latent variables are estimated at the E-step as the standard quantitative Chi-squared test proposed by Wu et al. (2013): Once these latent variables are estimated, the u i,m j,d weights are calculated at the M-step following u i,m j,d = χ 2 D i,m j,d /γ m | η m ).Since u i,m j,d depends on both the i th and m th observations, this model specifies a different instance of a Gaussian distribution for each α i j − α m j pair of contextual Dirichlet coefficients in the MRF.This allows u i,m j,d to regulate the variance of the corresponding Gaussian between the i th and m th observations, if an edge or an homogeneous area is detected at this location.Thus, as u i,m j,d increases, the Gaussian distribution for the corresponding pair shrinks around zero imposing a hard smoothing between the observations.On the contrary, as u i,m j,d decreases, the variance of the Gaussian distributions increases producing a lower pdf value that prevents the smooth.
This approximation avoids the introduction of new parameters since η m and γ m are completely known once i and m are fixed.Therefore, no numerical approximate methods are required, simplifying the model and reducing its degrees of freedom and the number of samples required for its statistically reliable estimation.
The graphical model of the NL-SVFMM is shown in Fig. 1.
Imposing the directional class-adaptive GMRF to this model, the new density for p (A) becomes which setting ∂ Q/∂α i j yields a new third degree equation of the form where Finally, β 2 j,d is now estimated as Hereafter, the pixel-wise χ 2 (1) version of the proposed NLSVFMM will be referred as NLv-SVFMM, while the patch-wise χ 2 (η m ) will be referred as NLp-SVFMM.
j,d for the S t-SVFMM model and U = u i,m j,d for the NLv-SVFMM (NLp-SVFMM weighting function is not depicted because is not numerically comparable to the S t and NLv-SVFMM functions)
Later, we have evaluated our proposed model with the 300 real-world images of the Berkeley Segmentation dataset (Martin et al. 2001).In our experimentation, we employed a three-dimensional feature vector for each pixel, comprising the 3 channels of the L*a*b color space.We also applied a local median smoothing to each channel using a 5 × 5 window centered at each pixel.We have evaluated the performance of each algorithm for different values of K = {3, 5, 7, 10, 15, 20}.
We have compared our proposed NLv-and NLp-SVFMM model with the conventional FMM, the SVFMM and the S t-SVFMM.For the spatially varying algorithms, we have employed the DCM Bayesian approximation and the directional class-adaptive GMRF prior specified in Sect. 1.All algorithms in all experiments were initialized with a deterministic version of K-means++ (Arthur and Vassilvitskii 2007) to ensure a fair comparative.is not depicted because is not comparable to the S t and NLv-SVFMM models).As figure shows, U function behaves more aggressive for differences between observations than the S t-SVFMM, hence yielding more dichotomous weighting maps (see Fig. 3).For the shake of simplicity, each pixel of each picture of Fig. 3 represents D d=1 m∈M i λ i,m j,d , with λ = g for S t-SVFMM and λ = u for NLv-or NLp-SVFMM models, respectively.
Table 1 and Fig. 4 show the superiority of the proposed NL-SVFMM (in both variants) to generate higher confidence prior probability maps for each component.An example of the contextual mixing coefficient maps for the HG0014 case of the BRATS 2013 dataset and its associated mixing coefficient values for different pixels obtained by each method is shown.In almost all evaluations, the NLp-SVFMM version  Table 2 shows the Dice coefficients obtained for the evaluation based on the BRATS 2013 dataset.Consistently with previous results, the NLp-SVFMM variant achieves the best results in terms of segmentations based on the maximization of the posterior probabilities (Bayes minimum classification error).An improvement of about 3 points in Dice is obtained when comparing the NLp-SVFMM with the standard SVFMM and more than 1 point in Dice with respect to the S t-SVFMM, thanks to the proposed prior density.Moreover, in order to explore the capabilities of the proposed prior densities to yield accurate segmentations, we have also computed the Dice coefficients for the segmentations based only on the maximization of the prior probability maps generated by each method.As Table 2 shows, the NLv-SVFMM method, followed by the NLp-SVFMM, achieves the best results.Of course, the Dice coefficients are significantly low because the segmentations do not take into account the pixel intensities.However, the aim is to evaluate which method better captures the local similarities in the images, hence producing more accurate prior information about the image.
Table 3 shows the results for the evaluation of the 300 images of the Berkeley dataset.RI is employed to measure the degree of concordance between the automated segmentation and the manual segmentations.Our experiments show that the proposed methods perform favorably in terms of RI to the other approaches in almost all situations.The NLv-SVFMM performs comparable to the S t variant in most of cases, achieving very similar results.However, the NLv-SVFMM requires less parameters, hence reducing the degrees of freedom of the model.Nevertheless, the NLp-SVFMM method achieves, in both average and median cases, the best results in most cases.Only in the K = 3 case (the simplest segmentation), the SVFMM method outperforms the rest of the models.However, as segmentation complexity increases, the models including edge preserving priors perform better in all cases.
It is worth noting that standard deviations are high and similar for all methods.This is due to several images that are intrinsically difficult to segment and present poor RI results across all methods uniformly, rather than to a high variability in the methods themselves.To corroborate it, we measured the percentage of cases of the Berkeley 300 dataset that showed RI improvement when segmented by our methods with respect to the SVFMM and the S t-SVFMM.For most of the K states, our NLp-SVFMM approach showed RI improvement in approximately more than the 85% of cases compared to the SVFMM, and more than the 80% of cases compared to the S t-SVFMM.This behavior indicates that there is a systematic improvement of our algorithms with respect to the previous approaches in the literature, which is not a product of random fluctuations due to high standard deviations inherent to the method.On the other side, it is also worth noting that differences in RI are not significantly large between methods, which is a reasonably behavior.Under the Bayes' decision rule, the prior probability p (Π ) (or p (A)) acts as an initial degree of belief of each label at each location of the image before observing the image, which ultimately represents a less informative distribution compared to the class conditional p (X |Θ).Therefore, the impact on the final results of changing the prior distributions is limited, and thus, the segmentation results are less affected.In addition, prior distributions become weaker as the number of observations in a problem increase, which is the case of pixel classification in an image.In those cases, variations in the prior densities also have a lesser impact in final results, which is observed in our experimentation.
A comparison in terms of the computational time required by each method has been made.Table 4 shows the average times (in seconds) and the SD of each method evaluated in the Berkeley 300 dataset for different number of segments calculated in the images.
As expected, the SVFMM is the fastest method since it does not carry the extra computation of the weights for constrain the β 2 j,d variances.It should be noted that only the NLv-SVFMM and the S t-SVFMM are directly comparable since both perform the calculation of the u and g weights, respectively, and those weights are computed pixel-wise.It can be seen that both methods perform very similar, with no significant difference between them.Although the S t-SVFMM model requires a numerical iterative approximation of the ν j,d parameters, which is often a slow procedure, the complexity in the computation of the g i,m j,d weights is lighter than the u i,m j,d weights.That is the reason why the NLv-SVFMM is a bit slower than the S t-SVFMM.The

Conclusion
In this study, we have proposed a new unsupervised image clustering algorithm that successfully merges the SVFMM framework with the well-known NLM filtering scheme.The main advantage of this algorithm is the proposed new MRF density over the contextual mixing proportions, which enforces local smoothness while preserving edges and the structure of the image.This MRF improves the previously proposed S t-MRF in terms of performance of the model, and also reducing the number of parameters to be estimated.Experimental results suggest that the proposed method performs favorably with respect to previous algorithms proposed in the literature when evaluated in a public reference dataset.

Fig. 2
Fig. 2 Comparison between the behavior of the weighting functions Figure 2 compares the behavior of the weighting functions G = g i,m j,d for the S t-SVFMM model and U = u i,m j,d for the NLv-SVFMM model (NLp-SVFMM weighting function

Fig. 3
Fig. 3 Comparison between G maps of the S t-SVFMM and U maps for the NLv-and NLp-SVFMM models for a case of the BRATS 2013 dataset.Each pixel i of the images represents D d=1 m∈M i λ i,m j,d , with λ = g or λ = u for S t-SVFMM and NLv-or NLp-SVFMM models, respectively

Fig. 4
Fig. 4 Example of contextual mixing coefficient maps for a case of BRATS 2013, for each class of the segmentation

4 The non-local spatially variant finite mixture model
One of the main drawbacks of the previous aforementioned SVFMM is that this model enforces local smoothness on the contextual mixing coefficients, without taking into account the structure of the image.In other words, the SVFMM iteratively applies an isotropic local Gaussian smoothing to the contextual mixing coefficients, which finally yields into a over-smoothed prior probability map that losses the information of edges and structure in the image.To overcome this limitation,Sfikas et al. (

Table 2
Results on the DICE coefficient over the 25 synthetic high-grade gliomas of the BRATS 2013 dataset for each algorithm evaluated Table indicates that the algorithm specified in that cell's column is the one that achieved best results in comparison with the others Segmentations based on the maximization of the posterior and prior probabilities are shown

Table 3
Results on the RI over the 300 images of the Berkeley dataset for each algorithm evaluated Table indicates that the algorithm specified in that cell's column is the one that achieved best results in comparison with the others