Variational Inference with Mixture Model Approximation for Applications in Robotics

We propose to formulate the problem of representing a distribution of robot configurations (e.g. joint angles) as that of approximating a product of experts. Our approach uses variational inference, a popular method in Bayesian computation, which has several practical advantages over sampling-based techniques. To be able to represent complex and multimodal distributions of configurations, mixture models are used as approximate distribution. We show that the problem of approximating a distribution of robot configurations while satisfying multiple objectives arises in a wide range of problems in robotics, for which the properties of the proposed approach have relevant consequences. Several applications are discussed, including learning objectives from demonstration, planning, and warm-starting inverse kinematics problems. Simulated experiments are presented with a 7-DoF Panda arm and a 28-DoF Talos humanoid.


I. INTRODUCTION
Many robotic approaches rely on the capability to generate in a quick and efficient way robot configurations fulfilling multiple objectives. These configurations are typically computed using iterative methods, which can leverage good and diversified initial guesses to improve the convergence speed and to find the global optimum.
Our work addresses the problem of representing the distribution p(x) of robot configurations that satisfy multiple given objectives. 1 We first show that our goal of representing the distribution of good and diversified configurations is the same as the goal pursued by variational inference (VI), a popular and scalable method for Bayesian computation. VI methods approximate an unnormalized density by a tractable one.
Similarly as [1] for a 10-DoF planar robot, we propose to use a mixture model to represent the complex and often multimodal distribution of solutions. We extend the framework to conditional distributions using a mixture of experts (MoE) approximation. Conditional distributions provide greater online adaptation capabilities to the robot, as it allows us to represent solutions corresponding to the distributions of parametrized objectives. We discuss how several forms of objectives in robotics can be included in this framework, illustrated by various robot applications.
The authors are with the Idiap Research Institute, CH-1920 Martigny, Switzerland. E-mail: {name.surname}@idiap.ch The research leading to these results has received funding from the European Commission's Horizon 2020 Programme through the MEMMO Project (Memory of Motion, http://www.memmo-project.eu/, grant agreement 780684) and COLLABORATE project (https://collaborate-project.eu/, grant agreement 820767) 1 Robot configurations are often denoted q, but we will avoid this notation because q are also used in variational inference to denote approximate distributions. The paper is accompanied by a full Python/Tensorflow library, allowing researchers to build upon the method and/or use it with other robots, by providing a URDF file.

II. TWO INTERPRETATIONS OF THE INVESTIGATED PROBLEM
We start by presenting two interpretations of the problem, both leading to the optimization of the same objective.

A. Product of experts
A product of experts [2] is a model multiplying several densities p m (which are called experts) together and renormalizing the density. The renormalization makes sure that it is a proper density and introduces dependencies between the experts. Illustratively, each expert m gives its opinion on a different view or transformation T m (x) of the data x. Their opinions are then fused together as the product For example, recalling that x denotes the configuration of a robot (joint angles and floating base 6-DoF transformation for a humanoid), T m (x) can be the position of a foot computed using forward kinematics and p m a Gaussian distribution defining where the foot should be. The product of experts becomes the distribution of configuration where the foot follows the distribution p m . By adding multiple experts representing various objectives, the product becomes the distribution fulfilling the set of objectives. We provide various examples of representations and transformations in the Appendix, including forward and inverse kinematics, the center of mass, relative distance spaces, orientations and hierarchies (nullspace structures).
We point out that the experts do not need to be properly normalized themselves, as the normalization already occurs at the level of the product. For compactness, we will later refer to the unnormalized product as where we dropped the parameters of the experts θ 1 , ... , θ M in the notation. In the general case, the normalized product p(x) has no closed-form expression. Gaussian experts with linear transformations are a notable exception, where p(x) is Gaussian, but is of limited interest in our case, due to the nonlinearity of the transformations we are considering. Approximation methods, as used in Bayesian computation, are required; the posterior distribution is indeed the renormalized product on θ of a likelihood and a prior.
Markov chain Monte Carlo: Markov chain Monte Carlo (MCMC) is a class of methods to approximate p(x) with samples. If MCMC methods can represent arbitrary complex distributions, they suffer from some limitations, particularly constraining for our application. They are known not to scale well to a high dimensional space. Except for some methods [3], they require an exact evaluation ofp(x) while stochastic variational inference only requires a stochastic estimate of its gradient. MCMC methods also struggle with multimodal distributions and require particular proposal steps to move from distant modes [4]. Designing a good proposal step is also algorithmically restrictive. Furthermore, it is difficult to obtain good acceptance rates in high dimension, especially with highly correlatedp(x). When using sampling-based techniques, it is also difficult to assess if the distributioñ p(x) is well covered.
Variational Inference: Variational inference (VI) [5] is another popular class of methods that recasts the approximation problem as an optimization. VI approximates the target densityp(x) with a tractable density q(x; λ), where λ are the variational parameters. Tractable density means that drawing samples from q(x; λ) should be easy and q(x; λ) should be properly normalized. VI tries to minimize the intractable KL-divergence where C is the normalizing constant. Instead, it minimizes the negative evidence lower bound (ELBO) which can be estimated by sampling as with x (n) ∼ q( . |λ) The reparametrization trick [6] [7] allows us to compute a noisy estimate of the gradient L(λ), which is compatible with stochastic gradient optimization like Adam [8]. For example, if q is Gaussian, this is done by sampling η (n) ∼ N (0, I) and applying the continuous transformation x (n) = µ+Lη (n) , where Σ = LL is the covariance matrix. L and µ are the variational parameters λ. More complex mappings as normalizing flows can be used [9]. a) Zero forcing properties of minimizing D KL (q||p): It is important to note that, due to the objective D KL (q||p), q is said to be zero forcing. If q is not expressive enough to approximatep, it would miss some mass ofp rather than giving probability to locations where there is no mass (see Fig. 1 for an illustration). In our applications, it means that we are more likely to miss some solutions than retrieve wrong ones.

B. Maximum entropy
Another interpretation is that the robot configurations should minimize a sum of cost representing the different objectives.
Computing arg min x c(x) is equivalent to maximum a posteriori in Bayesian statistics. Retrieving a distribution of configurations can be done by finding the distribution q under which the expectation of cost E q [c(x)] is minimized with the widest entropy H(q) It actually corresponds to (6), wherep(x) was replaced with Intuitively, it also fits with our initial objective to generate good (E q [c(x)]) and diversified (H(q)) samples.

III. MIXTURE MODEL VARIATIONAL DISTRIBUTION
For computational efficiency, the approximate distribution q(x; λ) is often chosen as a factorized distribution, using the mean-field approximation [5]. Correlated distribution can be approximated by a full-covariance Gaussian distribution [10]. These approaches fail to capture the multimodality and arbitrary complexity ofp(x). The idea to use a mixture for greater expressiveness as approximate distribution is not new [11], but the approach has regained popularity recently, see, e.g., [12], [13], [1].
A mixture model is built by summing the probability of k mixture components where π k is the total mass of component k. The components q k can be of any family accepting a continuous and invertible mapping between λ and the samples. The discrete sampling of the mixture components according to π k has no such mapping. Instead, the variational objective can be rewritten as meaning that we need to compute and get the derivatives of expectations only under each component distribution q k (x|λ k ).
Mixture components distributions: Gaussian components with a full covariance matrix are a natural choice for robotics, due to the quadratic form of its log-likelihood. It can be exploited in standard robotics approaches like linear quadratic tracking (LQT) or inverse kinematics (IK), more details in Sec IV. For some situations, wherep(x) is highly correlated, we propose to use "banana-shaped" distribution [14], which is done by applying the following differentiable mapping to Gaussian samples η (n) ∼ N (0, I), This mapping can be applied along different parametrized directions. As illustrated in Fig. 1 (b), we get a Gaussian with full covariance, where κ is a supplementary variational parameter encoding curvature.
Conditional distribution: Let us suppose the following problem: we want to sample humanoid configurations in static equilibrium within the joint limits where the feet are close to given poses. As this distribution differs substantially given the poses of the feet, we would like to retrieve this distribution conditioned on the feet locations. That way, no new computation is required according to the current objective, providing a very fast online adaptability. More generally, we want to approximatep(x|y) where y is a task parameter, for a distribution of possible p(y).
For example, if y m defines a forward kinematics target, the expert p m can be We propose to use as approximate distribution a mixture of experts (ME) where h k are the gate and q k are the conditional mixture components as in (12). q k are normally referred to as experts. 2 From (12), we change the mass π k and the mixture components q k to depend on the task parameters y.
In this work, we will use q k as Gaussian around a linear function, but more complex functions can be considered. Samples from q k (x|y; λ k ) can be generated by first drawing η (n) ∼ N (0, I) and then applying the mapping x (n) = W k y + c k + L k η (n) . The variational parameters λ k are W k , c k and L k .
Given y, the objective (6) becomes 2 We will avoid this term because of possible confusion with the product of experts. which we would like to minimize under the distribution of possible task parameters p(y), namely y p(y)L(λ, y)dy = E p(y) Eq[log q(x|y; λ) − logp(x|y)] This objective can also be minimized by stochastic gradient optimization [6], [7], by sampling first y (n) ∼ p(y) and then x (l,n) ∼ q(x|y (n) ; λ) for each y (n) .

A. Learning objectives from demonstration
It might seem intuitive to choose the transformations needed to represent a desired joint distribution for a given task, but choosing the experts parameters θ 1 , ... , θ M can be more complex (if not infeasible). These parameters can be learned by providing samples of possible desired configurations, for example through kinesthetic teaching. From a given dataset of robot configurations X, maximum likelihood (or maximum a posteriori) of the intractable distribution p(x|θ 1 , ... , θ M ) should be computed. It can be done using gradient descent [2] with where C(θ1, ... , θM ) is the intractable normalizing constant. Computing a step of gradient requires us to compute an expectation under the intractable distribution p(x|θ 1 , ... , θ M ), making the process computationally expensive. In [15], it is proposed to use a few sampling steps initialized at the data distribution X. Unfortunately this approach fails when p(x|θ 1 , ... , θ M ) has multiple modes. The few sampling steps never move between modes, resulting in the incapacity of estimating their relative mass. Wormholes have been proposed as a solution [16], but are algorithmically restrictive. Instead, we propose to use VI with the proposed mixture distribution to approximate p(x|θ 1 , ... , θ M ). The training process thus alternates between minimizing D KL (q||p) with current p and using current q to compute the gradient (20). q can either be used as an importance sampling distribution (or directly if expressive enough to represent p). The mass of the modes is directly encoded as π k which makes it much more convenient than moving samples between distant modes.
This process seems overly complex compared to directly encode p(x), for example as a mixture, a non-parametric density or with a variational autoencoder. However, it offers several advantages, especially when we have access only to small datasets, and/or when very good generalization capabilities are required, namely: • Simple and compact explanation can be found for complex distributions.
• Domain specific a priori knowledge can be included in the form of particular transformations and experts, which would reduce the need for data. More complex transformations can still be learned, if more data are available and more complex objectives are considered. • The model is more structured and interpretable for a human user. It could facilitate interactive or active learning procedure. For example, to learn directly the distribution shown in Fig. 1, a lot of samples spanning the whole distribution would be required. Finding a simple explanation under a known transformation, following Occam's razor principle, increase generalization capabilities. Very sharp and precise distributions could be obtained with a very limited number of parameters.

B. Planning
Sampling-based robot motion planning [17] requires sampling of configurations in obstacle-free space, while possibly satisfying other constraints. One of the difficulty is to generate those samples, especially with precise constraints and in high dimension space, which cannot be done with standard rejection sampling approaches. Uniform sampling and projection into the constraint is proposed in [18], and dedicated approaches for closed chains are proposed in [19], [20]. Variational methods can be interesting when high dimension configuration spaces are involved, as they are known to scale better than sampling.
Another challenge of sampling-based planning is to make sure that the space is well covered, ensuring good connectivity. This is indeed a general problem of MCMC methods when approximating a distribution. Using a mixture model, it is easier to assess for connectivity as the existence of common mass between components. The overlap between two components k and l can be computed for example with which is closed-form in case of Gaussian components. Due to the zero forcing properties of VI, if the components have some common mass, it is very likely that they are connected.

C. Warm-starting inverse kinematics
Inverse kinematics problems are typically solved with iterative techniques that are subject to local optimum and benefit from a good initialization. It often arises that the poses we find should satisfy general, task-specific and goalspecific constraints. For example, a humanoid can achieve a wide range of tasks with its hands while constantly keeping its feet on the ground and ensuring balance. To be able to warm-start the optimization of all the constraints together, it would be beneficial to pre-compute the distribution of poses, already satisfying the general constraints.
If Gaussian components are used, a nice property is to be able to compute the intersection of objectives without further optimization. For example, if we have q A (x; λ) approximating the distribution of poses satisfying the set of Precise forward kinematics objectives often create a highly correlated distribution of joint angles, which is difficult to sample from or to estimate as a mixture of Gaussians. (a) Joint distribution of a 2-DoF planar robot whose end-effector should stay on a line. The colormap represents exp(−c(x)) and the lines are isolines of c(x). Probability mass is located in yellow (or lighter) areas. (b) A mixture of "banana-shaped" distributions can substantially reduce the number of mixture components required to represent the distribution, thanks to its capacity to adapt to the local curvature. (c) A mixture of Gaussians with the same number of components would not be able to cover efficiently the distribution. Here, for illustrating our point, we employed fewer components than required. The zero forcing properties of D KL (q||p) are visible here, as no mass of q is assigned wherep has none. objectives A, defined by the intractable distributionp A (x) and q B (x; λ) for the set B. We can directly compute q A∩B (x; λ), the distribution of configurations satisfying both sets as a product of mixtures of Gaussians, which is itself a mixture of Gaussians [21].

V. EXTENSION TO PRIORITIZED OBJECTIVES
The proposed approach is also compatible with classical null-space formulations [30]. We can exploit the fact that stochastic variational inference does not need to evaluate the unnormalized densityp but only its gradient. This characteristic is shared with only a very few Monte Carlo methods such as [3]. It allows us to filter the gradient of the different objectives (expert and transformation). The derivative of the expert log-probability with respect to the configuration x can be written as where we make the Jacobian J m (x) of the transformation appear. When multiple tasks should be prioritized, a nullspace filter N 1 (x) can be added such that the gradient of the secondary objective only acts perpendicularly to the main objective, namely Note that when using automatic differentiation library, gradients can be easily redefined with this filter. (c) The gradient of the objective of the orange arm is projected onto the null-space of the Jacobian of the forward kinematics of the blue arm. This is made possible because stochastic variational inference only requires to evaluate the gradient of the unnormalized density. Fig. 2 shows a 5 DoF bimanual planar robot with two forward kinematics objectives. When the tasks are compatible, the filtering has no effect. The gradient of the objective of the orange arm can be projected onto the null-space of the Jacobian of the forward kinematics of the blue arm, resulting in a prioritization.

VI. EXPERIMENTS
In the experiments, we evaluate the quality of the approximation q with respect to the number of components and the choice of the component distribution. The model with one component is the baseline, which corresponds to the common Gaussian variational approximation (GVA). The quality of the approximation is evaluated using different similarity and divergence measures betweenp and q. These quantities are computed with integrals, which are evaluated by discretizing the configuration space within joint limits. We consider 3 scenarios: • a 2-DoF planar robot whose end-effector should stay on a line; • a 7-DoF Panda arm with a given position target for its gripper; • a 28-DoF and floating base (6-DoF) Talos humanoid with a given position target of left gripper and feet, where the robot should also ensure static equilibrium on a flat ground. In each one, the robot has more degrees of freedom than required by the task, creating a rich distribution of solutions. For the floating base, the rotation is parametrized using Euler angles. We make the assumption that the solution has a small variance. For a more rigorous treatment of the problem, a circular or higher-dimensional tractable density should be chosen. Due to the generally intractable normalizing constant and the difficulties to sample such directional distributions, it is still an open research question. Table I reports the results for the 2-DoF robot and Fig. 1 displaysp and q as colormaps. As expected, one Gaussian is not expressive enough to represent the highly correlated distribution created by precise forward kinematic objectives. In a 2-dimensional space, a relatively low number of components is sufficient to approximatep almost exactly. Models with "banana-shaped" component reach quicker the distribution due to their ability to bend. Table II reports the results for the 7-DoF Panda robot. Because of the 7 dimensions, the distribution is significantly richer and more complex to represent. If the results suggest that we are not representing the distribution as well as in the previous scenario, Fig. 3 shows that we already have a quite good variety of configurations with 10 Gaussian components. For a better approximation, an iterative training procedure can be used, such as proposed in [12], [13]. It incrementally builds a richer approximation by increasing the number of states. Compared to sampling, an interesting feature of the proposed method is that the insensitivity of the task with respect to the last joint of the robot (the wrist) can be directly encoded in the covariance matrices. This greatly reduces the number of units (samples or clusters) to represent the distribution, which can be crucial in high-dimension space.
For the humanoid robot, running evaluations using discretization is too expensive. We show in Fig. 4 samples of configuration from q, a Gaussian mixture model with K = 50 components, using Hamiltonian Monte Carlo (HMC). We allowed for a limited 40 sec. of stochastic optimization for variational inference and HMC steps (equivalent to 2000 steps). For HMC, the kernel was chosen Gaussian with a standard deviation of 0.1. Fig. 4 shows a higher variance of the right arm for variational inference. The related degrees of freedom have little to no influence on the task (only for balancing). Thanks to the encoding of Gaussian covariances and the maximization of entropy in (10), the method can quickly localize unimportant dimensions. Compared to the sampling approach, it has benefits both in terms of computational and memory efficiency, as well as in generalization capabilities.
In [25], the authors also addressed a similar problem on the Talos robot using a GMM. They approximated the feasible positions of the CoM given the set of contact points, which is only a 3 dimension problem. Learning the GMM is done with a two-steps procedure implying sampling and then fitting the model. This procedure would be problematic with higherdimensional problems. Our approach directly copes with full state of the robot but further developments are required to tackle the feasibility in case of a non-flat terrain.

VII. CONCLUSION
We proposed the use of variational techniques with mixture models as a general tool to represent distributions of robot configurations satisfying several forms of objectives. We emphasized its advantages over sampling methods, showing that computational and memory requirements can be significantly reduced in high dimension space, which can be crucial in many applications. The developed approach, and associated source codes, can be exploited in a wide range of applications. It is particularly promising for training unnormalized density (e.g., energy-based model or PoE) as it can handle multimodality better than sampling approaches. This direction will be further developed and presented in future studies.