The Shallow Gibbs Network, Double Backpropagation and Differential Machine learning

We have built a Shallow Gibbs Network model as a Random Gibbs Network Forest to reach the performance of the Multilayer feedforward Neural Network in a few numbers of parameters, and fewer backpropagation iterations. To make it happens, we propose a novel optimization framework for our Bayesian Shallow Network, called the {Double Backpropagation Scheme} (DBS) that can also fit perfectly the data with appropriate learning rate, and which is convergent and universally applicable to any Bayesian neural network problem. The contribution of this model is broad. First, it integrates all the advantages of the Potts Model, which is a very rich random partitions model, that we have also modified to propose its Complete Shrinkage version using agglomerative clustering techniques. The model takes also an advantage of Gibbs Fields for its weights precision matrix structure, mainly through Markov Random Fields, and even has five (5) variants structures at the end: the Full-Gibbs, the Sparse-Gibbs, the Between layer Sparse Gibbs which is the B-Sparse Gibbs in a short, the Compound Symmetry Gibbs (CS-Gibbs in short), and the Sparse Compound Symmetry Gibbs (Sparse-CS-Gibbs) model. The Full-Gibbs is mainly to remind fully-connected models, and the other structures are useful to show how the model can be reduced in terms of complexity with sparsity and parsimony. All those models have been experimented with the Mulan project multivariate regression dataset, and the results arouse interest in those structures, in a sense that different structures help to reach different results in terms of Mean Squared Error (MSE) and Relative Root Mean Squared Error (RRMSE). For the Shallow Gibbs Network model, we have found the perfect learning framework : it is the $(l_1, \boldsymbol{\zeta}, \epsilon_{dbs})-\textbf{DBS}$ configuration, which is a combination of the \emph{Universal Approximation Theorem}, and the DBS optimization, coupled with the (\emph{dist})-Nearest Neighbor-(h)-Taylor Series-Perfect Multivariate Interpolation (\emph{dist}-NN-(h)-TS-PMI) model [which in turn is a combination of the research of the Nearest Neighborhood for a good Train-Test association, the Taylor Approximation Theorem, and finally the Multivariate Interpolation Method]. It indicates that, with an appropriate number $l_1$ of neurons on the hidden layer, an optimal number $\zeta$ of DBS updates, an optimal DBS learnnig rate $\epsilon_{dbs}$, an optimal distance \emph{dist}$_{opt}$ in the research of the nearest neighbor in the training dataset for each test data $x_i^{\mbox{test}}$, an optimal order $h_{opt}$ of the Taylor approximation for the Perfect Multivariate Interpolation (\emph{dist}-NN-(h)-TS-PMI) model once the {\bfseries DBS} has overfitted the training dataset, the train and the test error converge to zero (0). 



Introduction
We introduce a novel ensemble learning approach which combines random partitions models with a non-parametric predictor such as Multilayer feedforward networks. Neural networks are known as universal approximators (Hornik et al. (1989) and Cybenko (1989)), and are very well suited to explore other learning methods. We combine them with Potts clustering models to create a bagging-boosting-averaging-like learning framework where several estimates from each random partition are aggregated into one prediction. Our approach carries out the balance between overfitting and model stability in presence of higher dimensional data. More precisely, our model merges Potts model in a multivariate multiple regression task with Bayesian deep learning models to produce the ensemble learning method. Potts model clustering has been introduced to the statistical community by Murua et al. (2008a). The model is applied to the set of covariate values, formalizing a proximity co-clustering. The method has been used and proven to be effective (see Murua & Wicker (2014); Murua & Quintana (2017)).
The model called Structured Potts Shallow Gibbs Neural Network will be a hierarchical Bayesian model where we train individual neural nets to specialize on sub-groups (latent clusters components) while still being informed about representations of the overall population. Our Potts neural network model differs from those of Kanter (1988) and Philipsen & Cluitmans (1993), which is a generalization of the Ising neural network. We call it a structured one, because we integrate the structured correlations among the weights (and offsets) of the network (Sun et al., 2017) through a Markov Random Fields (MRF) process. Bayesian learning allows the opportunity to quantify posterior uncertainty on neural networks (NNs) model parameters. We can specify priors to inform and constrain our models and get structured uncertainty estimation.

Efficiency of regression clustering
Regression clustering is a learning algorithm that allows multiple regression settings to be clustered where you have a dependent Y vector. And one or two autonomous (independent) variables, the X's. Recovering the implicit partitioning of observations is the problem of regression clustering: the algorithm then divides the data into two or more clusters and performs an independent multiple regression within each cluster on the data. Many algorithms have emerged to run this approach in the past (see Späth (1979) andH.-J. (1985)). The ultimate aim is to discover classes without oversight before regression is implemented in each class. Nowadays, many variations have been proposed in machine learning; clustered linear regression 1 (CLR) is an example (Ari & Güvenir, 2002).
Regression clustering has shown a significant predictive advantage in several tasks: clustered regression trees (Torgo & da Costa, 2000), ridge regression clustering (Nagpal et al., 2013), and non-parametric Bayesian clustering regression (see Müller et al. (2008), Yang et al. (2014), and Murua & Quintana (2017)). The combination of clustering and regression approaches will potentially minimise the possible problems of predictive performance efficacy due to data heterogeneity (Chen et al., 2013). Using a clustered method, the data subsets are created with a degree of homogeneity that improves the accuracy of prediction.

Combination of neural network regression and Potts clustering model
Potts clustering has the ability to integrate multiple partitions from different cluster solutions to improve the robustness, stability, scalability of the clustering algorithms (Murua et al., 2008b). As shown by Murua & Quintana (2017), Potts clustering can be a good prior for statistical models such as regression. Our model is a multivariate multiple regression clustering one, aiming at combining the Potts clustering model with a non-linear multivariate multiple regression tool: neural networks. The first main advantage of the combination of neural networks regression and Potts clustering models, is that a few of the drawbacks of linear regression can be overcome by using artificial neural network (ANN), and clustered models can improve prediction accuracy. Their combination presents other key advantages that make them most suitable for certain problems and situations: 1. An artificial neural network is capable of fitting and modelling non-linear and complicated relationships, which is really important since many of the relationships between inputs and outputs are both non-linear and complex in real life.
2. An artificial neural network is a good predictor: it can also infer unseen relations on unseen data after the necessary fitting step from the initial inputs and their relationships, thereby allowing the model to predict on unseen data.
3. Extension of Potts model to various forecasting problems: in addition to exploring a graph-based consensus clustering 2 (GCC) to find cluster structures from heterogeneous data, the model proposed here offer a novel opportunity to couple neural network model with random partition models in diverse machine learning tasks, such as multivariate multiple regression (precisely: when Y response is a vector). In fact, Potts clustering as used here, is a random partition model, explicitly, a clustering model with prior distribution on partitions (Müller & Quintana, 2010a). Section 3 describes Potts clustering in detail.
However, ANN is a black box learning technique that can not interpret input-output relationships and can not cope with uncertainties. Using the Bayesian framework as done by Murua & Quintana (2017) can overcome uncertainties issue, and the results can be easily compared. The Bayesian technique is highly important, as traditional neural network training involves a lot of labelled data to monitor the possibility of overfitting. And when it comes to real-world regression assignments, the task gets more complicated. Such exercises (regression) also have less training data to use, which also makes it easy for neural networks to get stuck in overfitting.
A principled approach for solving this problem is Bayesian Neural Networks (see Vehtari & Lampinen (1999), Bishop (1997)). Prior distributions are placed on the neural network weights in Bayesian Neural Networks to consider the model uncertainty. One can fit a predictor by doing Bayesian inference on the weights, which both matches the training data and knows about the volatility of its own estimation on the test data (Blundell et al., 2015).

Shallow Gibbs networks
Let D = {(y i , x i ) : i = 1, . . . , n} denote the complete data, where x n = {x 1 , . . . , x n } ⊂ R q is the set of input vectors or covariables, and y n = {y 1 , . . . , y n } ⊂ R p , the set of associated responses. The matrix of covariables will be denote by X = (x 1 |x 2 |· · · |x n ) T ∈ M n×q (here and throghout the chapter, the symbol T will indicate matrix transposition). Our shallow network is, as in the general case, a feedforward network given by the equations where the layer 0 corresponds to the input data h 0 = x, b (0) = 0, g 0 (x) = x, the first layer corresponds to the hidden layer with vector output h (1) , and the output layer corresponds to the network's predicted values h (2) =ŷ. The parameters {(b (k) , W (k) ) : k = 1, 2} are, respectively, the matrices of offsets, also known as biases, and weights. The vector g k−1 (h (k−1) ) denotes a linear (k=2) or non-linear (k=1) function, such as the identity or a logistic sigmoid, that is applied element-wise. The top layer output h (2) is used for making a prediction and is combined with the supervised target y into a loss function L(h (2) , y), which is typically convex in h (2) = b (2) + g 1 (h (1)) W (2) . As in any regression model, the network's predicted value h (2) is an estimate of f (y) = E(y|x). Our model may be seen as a simplified feedforward network where the weight matrices are constrained to obey a Gibbs distribution whose neighborhood structure is explained below. Figure 1 sketches our network architecture.
Let w (1) = vec(W (1) ) be the q × l 1 -dimensional vector of stacked columns of the weight matrix W (1) . The vector w (2) = vec(W (2) ) ∈ R l 1 ×p is defined similarly. Let w = (w (1) , w (2) ) be the vectorized version of the weight matrices (W (1) , W (2) ). Our model is set into a Bayesian framework by setting a Gaussian Markov random field prior for ψ. The complete model is the similar to the one described in (Sun et al., 2017). But ours presents significant differences in the actual architecture, because we set up two independent fields: one on all the network weights w = (w (1) , w (2) ), and another one on all the network biases b = (b (1) , b (2) ). We assume that both random fields are zero-mean Gaussian fields.
Let Ω ∈ M (q+p)l 1 ×(q+p)l 1 and Γ ∈ M (l 1 +p)×(l 1 +p) be the precision matrices associated with the random field of the weights, and the biases, respectively. Here we assume that the biases are independent, so that Γ is a block-diagonal matrix with blocks Γ 1 ∈ M l 1 ×l 1 , and Γ 2 ∈ M p×p . The corresponding prior densities are given by p(w|Ω) = (2π) − l 1 (p+q) 2 det(Ω) 1 2 exp − 1 2 w T Ωw , for the weights, and (2) for the biases. Our framework for this part is completed by assuming an Inverse-Wishart(Λ, ν) prior for the covariance matrix Σ. Here Λ ∈ M p×p is the prior scale matrix, and ν > 0 is the prior degrees of freedom.

The sparse-Gibbs network
We define the neighborhood of the random field based on the nodes of the hidden layer. All weights coming or going out of the same node are considered neighbor weights. That is, weights W st are neighbors if and only if j = t, j, t ∈ {1, . . . , p}. The matrix Ω is sparse, and composed of l 1 blocks of size q × q, p blocks of size l 1 × l 1 , and l 1 blocks of size q × p as well as the associated transposed blocks of size p × q. Note that the total number of non-zero elements in Ω is l 1 (q 2 + p 2 + 2pq) = l 1 (q + p) 2 , which corresponds to a sparsity ratio of l 1 (p + q) 2 /(l 1 (q + p)) 2 = 1/l 1 . This rate is very significative even for small values of l 1 .
Let V i• ∈ M q×q be the block of the precision matrix associated with column i of W (1) , i = 1, . . . , l 1 , V• j ∈ M l 1 ×l 1 be the block of the precision matrix associated with column j of W (2) , j = 1, . . . , p, and V ii ∈ M q×p be the block of the precision matrix associated with column i of W (1) , and row i of W (2) , i = 1, . . . , l 1 . Let also e i;l 1 ∈ M 1×l 1 be the ith row of the l 1 × l 1 identity matrix. Then where the symbol ⊗ stands for the Kronecker product. We will refer to this simplified network as the sparse-Gibbs network, as opposed to the full Gibbs network that imposed no constraints on Ω.

Compound symmetry Gibbs network
Note that the above matrix consists of four blocks, that is Ω = Ω 1 Ω 12 Ω T 12 Ω 2 , with Ω 1 ∈ M ql 1 ×ql 1 , Ω 2 ∈ M pl 1 ×pl 1 , and Ω 12 ∈ M ql 1 ×pl 1 . Also, if all matrices V i• were equal, then we could write Ω 1 = I l 1 ⊗ V 1• , where I r denotes the identity matrix in M r×r . Similarly, if all matrices V• j were the same, we could write Ω 2 = I p ⊗ V• 1 . We could also suppose that all matrices V ii are equal. These simplifications would greatly reduce the number of parameters defining the network.
A less stringent Gaussian random field model consists of replacing I l 1 and I p in the above Kronecker products for more general matrices U 1 ∈ M l 1 ×l 1 , U 2 ∈ M p×p . This is the second model we consider. However, we simplify it by constraining the inverse matrices of U 1 , U 2 , V 1• , and V• 1 to have compound symmetry. That is, the inverse matrices of these matrices have to be of the form κ 1 (1 − κ 2 )I l 1 + κ 1 κ 2 11 T , for κ 1 > 0, and κ 2 ∈ [−1, 1], where 1 stands for the column matrix of all elements equal to 1. It can easily be shown that this constraint implies matrices of the form aI l 1 + b11 T , with (a, b) ∈ R + × R. The matrices V ii are not constrained. We will refer to this network architecture as sparse compound symmetry Gibbs network or sparse-CS-Gibbs network for short.
Two less structured models. More complex models can be conceived by (a) changing the neighborhood in the Gibbs network, and by allowing the diagonal blocks Ω 11 and Ω 22 to be unconstrained. In the Gibbs network this corresponds to consider that all weights coming from the same layer are neighbors, while still keeping the neighborhood restriction on weights coming to and going out of the same node. We will refer to this model as the between-layer sparse-Gibbs network. The second extended model correspond to the compound symmetry network in which the off-diagonal block Ω 12 is set to be of the form U ⊗ V , for matrices U ∈ M l 1 ×l 1 , and V ∈ M q×p , or, as we have actually implemented in our model, for matrices U ∈ M l 1 ×p , and V ∈ M q×l 1 . Note that the latter matrix structure is always sparser than the former one when l 1 ∈ [0, min{p, q}) ∪ (max{p, q}, +∞).
We will refer to this architecture as the compound-symmetry Gibbs network or CS-Gibbs for short.

The random-Potts partition model
The purpose of creating the sparse networks of the preceding section is to be able to train the networks with a fraction of the data used by a classical network, As mentioned earlier, our prediction model is a finite but very large mixture of shallow and sparse networks. This is the result of using a non-parametric Bayesian model for prediction. The networks are trained on a subset of data that share similar characteristics in their features. Classical Bayesian non-parametric models rely on the Dirichlet process. However, these processes do not necessarily look at the data features to create clusters or partitions of the data. Recent work has overcome this limitation Müller & Quintana (2010b); Müller et al. (2011). In particular, the random-Potts partition model introduced in Murua & Quintana (2017) stresses the importance of data similarities in the resulting partitions by guiding the random partition process through feature characteristics. These are incorporated in the model through a kernel which in turns induces the partition probabilities. The random-Potts partition model may be seen as a stochastic version of the label propagation 3 approach (Tibély & Kertész, 2008).
The random-Potts partition model is a probabilistic model defined on a graph. Keeping the notation introduced before, each data point x i defines a vertex in a graph whose edges are pre-specified. For example, Murua & Quintana (2017) consider a k-nearest-neighbor graph. That is, a graph for which there is an edge between data points x i and x j if and only if either x j is one of the k most similar data points to x i , or x i is one of the k most similar data points to x j . Associated with any data point in the graph there is a color label z i ∈ {1, . . . , r}.
Usually r is chosen to be between 10 and 30 (Blatt et al., 1996b;Stanberry et al., 2008). The set of color labels z n = (z 1 , . . . , z n ) follows a Potts model, and thus it has probability mass function (pmf) where J ij (σ) = J(x i , x j ; σ) is the kernel evaluated at (x i , x j ), I[·] = 1 is the condition between brackets is satisfied, and is zero otherwise, and where the notation i ∼ j means that x i and x j are neighbors. In the Physics literature the parameter β is referred to as the inverse-temperature. In practice, one chooses the kernel so as to penalize color labelings that do not assign the same color to very similar data points. For example, one can work with J ij = J ij (σ) = J( x i − x j /σ), where · denotes the distance (e.g., Euclidean), and σ > 0 is a bandwidth parameter. The partitions are generated by iteratively deleting edges and inserting deleted edges at random. This is done through a set of random binary variables b = {b ij } named the bonds.
. Each bond b ij is generated independently of the other bonds as a Bernoulli(p ij ). If i ∼ j, the edge between x i and x j remains in the graph with probability p ij (i.e., b ij = 1), and it is deleted from the graph with probability 1 − p ij (i.e., b ij = 0). A random partition ρ n is generated as the (randomly generated) connected components ρ n = {S 1 , ..., S kn } of this new random graph. In what follows, we will refer to the components of a partition as clusters. It can be shown that this procedure is governed by the random cluster pmf (Sokal, 1997) Colors are also randomly assigned once the bonds have been chosen. Each connected component is assigned a color uniformly at random. This process of sampling bonds and colors alternatively is known as the Swendsen-Wang algorithm. It is a special case of the Markov chain Monte Carlo (MCMC) stochastic algorithm. Its goal is 3. A fast algorithm for finding communities in a graph is the Mark Propagation Algorithm (LPA). It defines certain communities as its guide using the network structure alone, and does not require a predefined objective feature or prior group knowledge. LPA is a relatively new algorithm, and was only proposed by Raghavan et al. (2007), in "Near linear time algorithm to detect community structures in large-scale networks". It functions by distributing labels across the network and building communities focused on this label dissemination mechanism. There are some references online : Xie & Szymanski (2011), Xie & Szymanski (2013), Gregory (2010). The paper of Tibely and Kertesz (2008) demonstrates an equivalence of the label propagation method of community detection and Potts model approach.
to generate samples from the random-Potts partition model. This is achieved when the two sampling steps, bonds and colors, have been applied a sufficiently large number of times. The distribution of the random-Potts partitions is given by where the notation b ⇒ ρ n means that the partition ρ n arises from the connected components of the associated graph with edges given by the bonds b (see (Murua & Quintana, 2017) for more in-depth overview). The calculation in the above equation 5 might be intractable. Fortunately, we can sample partitions with MCMC methods without knowing these quantities exactly. These will be detailed below in Section 6.

Some practical considerations
For our experiments of Section 12, we use the Gaussian kernel J ij (σ) = exp(− 1 2σ 2 x i − x j 2 ), which is the most popular kernel choice for the Potts or super-paramagnetic clustering model (Blatt et al., 1996a,b;Murua et al., 2008a). Although the scale parameter σ may be estimated through a Bayesian stochastic procedure (Murua & Wicker, 2014), we prefer to use its common estimator which is given by the average distances expressed aŝ Murua & Wicker (2014) show that the optimal scale is close to this simple estimator. To simplify the computations involved with the Potts model, we restricted the data neighborhood to a k-nearestneighbors graph; that is, only the k-nearest-neighbors of a given point x i are considered as neighbors of x i . This implies that we set One of the main issues encountered with the Potts model is the choice of the inverse temperature parameter β. This parameter controls the cluster sizes in the partitions. A value too low of β produces too many small clusters, while a value too high produces very few and large clusters. Because a shallow Gibbs network is going to be fitted for each cluster of the partition, we would rather not have too small size clusters. Therefore, a small value of β is preferred. In our experiments, for each dataset, we selected a value of β that produced large enough clusters. In general, Murua & Wicker (2014) gave a simple procedure to find a nearly optimal value of β. In our case, we just chose a value slightly smaller than the value suggested so as to ensure large cluster sizes.

The Potts Clustering Model with Complete Shrinkage
The drawback of the bonds approach is the increasing number of small clusters generated in a given partition. To deal will it, we apply a modified agglomerative clustering approach (Kurita, 1991) by merging all small clusters of size ≤ h with their closest cluster in terms of minimal distance respectively, where h is an integer greater or equal to 2. The algorithm uses a technique in which distances of all pairs of observations are stored. Then the nearest cluster (with size ≥ h) is given by the cluster with the closest node in terms of minimal distance to the cluster to be merged. This is what we call Potts Clustering with Complete Shrinkage (PCCS).

The shallow Potts Gibbs-network mixture model
Combining the shallow Gibbs networks introduced in Section 2 with the random-Potts partition model described in the previous section we obtained the shallow Potts Gibbs-network mixture model. This model will be referred to as Potts Gibbs-network for short, since "Potts" already conveys the idea of mixture.
Keeping the notation introduced in the previous section, ρ n denotes a partition of the data, ψ σ = (w, b, Σ) denotes the parameters of a shallow network, and (Ω, Γ, Λ, ν) denotes the parameters defining the prior on ψ σ . For a given partition ρ n = (S 1 , . . . , S kn ), the variables s 1 , . . . , s n will denote the cluster membership of the data points. That is, s i = if and only if x i ∈ S , i = 1, . . . , n, = 1, . . . , k n . Note that k n , the number of clusters or connected components of ρ n , is a random variable. Proceeding as in (Murua & Quintana, 2017), we assume a hierarchical random partition model for the data y 1 , ..., y n |x n , ρ n , ψ * σ1 , ..., ψ * where (ψ * σ1 , . . . , ψ * σkn ) denote the unique values of (ψ σ1 , . . . , ψ σn ) given the k n -cluster partition ρ n . Thus, ψ * σi are the parameters of the shallow Gibbs network associated with the i-th data cluster S i of ρ n . The notation ind ∼ indicates that the variables are independent.
We will refer to the above steps as Besag's algorithm. To use this scheme, we need to be able (i) to sample from the posteriors p(ψ σ |y n , x n , ρ n ), and (ii) to compute the ratio α. In general, the probabilities p(y n |ρ n , x n ) are intractable. In fact, computing α is equivalent to computing Bayes factors. This algorithm works exactly only for certain problems such as the multivariate multiple regression with conjugate priors used in (Murua & Quintana, 2017). In fact, if p(ψ σ |y n , x n , ρ n ) is known, we can compute the marginal probabilities using the identity p(y n |ρ n , x n ) = p(y n |ψ σ , ρ n , x n )p(ψ σ |ρ n )/p(ψ σ |y n , x n , ρ n ).
In our case, the posteriors p(ψ σj |y n , x n , ρ n ) are proportional to where T j = {y i : x i ∈ S j }. The normalizing constants of these posteriors are not known because of the nonlinearity in the neural network likelihood p(y i |S j , ψ σj ). To avoid integrating over the posterior distribution so as to compute the normalizing constants, it is common to use Markov Chain Monte Carlo (MCMC) methods (Neal, 1996;Rasmussen, 1995). Another line of research for posterior inference uses stochastic gradient Markov Chain Monte Carlo (Chen et al., 2015). However, one major drawback of sampling, is that it is often very slow, especially for high-dimensional models (Robert & Casella, 2005). Faster and easy to parallelize methods may be drawn from Bayesian variational inference algorithms. These have been developed to be as flexible as MCMC (Zhai et al., 2012). Therefore, this is the route we follow to overcome the computational burden of posterior and Bayes factors calculation.

Bayesian variational inference
In this section we explain how we find a Bayesian variational approximation of the posterior distributions. We choose a family of distributions of the model parameters ψ σ = (w, b, Σ) indexed by a hyper-parameter λ, Q(ψ σ ; λ). The idea is to approximate the posterior of ψ σ with its closest distribution from this family. This corresponds to an optimization over the hyper-parameter λ. In this section, we suppose that we work with a given cluster S of the partition ρ n , and its associated set of response variables T . The core idea is to minimize over λ the Kullback-Leibler divergence (KL) between the posterior and Q(ψ σ ; λ). That is, we need to minimize KL(Q||p) = Q(ψ σ ; λ) log{Q(ψ σ ; λ)/p(ψ σ |T, S, ρ n )}.
The optimal value λ opt gives rise to the approximating variational posterior distribution Q(ψ σ ; λ opt ).
The ELBO. Note that the optimal value is the solution of the optimization problem where the term to be maximized is known as the evidence lower bound or ELBO: The negative of this quantity is also known as the description length cost, or the variational free energy (see, for example, Friston et al. (2007)). The direct optimisation of this quantity instead of the Kullback-Leibler divergence was introduced by Hinton & van Camp (1993) (see also, Graves (2011)). In general, the ELBO cannot be computed exactly. The expectations are usually approximated through Monte Carlo methods (Blundell et al., 2015).
Choice of variational family. If the likelihood p(T |ψ σ , S, ρ n ) were linear in the weight and bias parameters, the posterior of the parameters would be conjugate to the prior (Murua & Quintana, 2017). Inspired by this observation, we consider λ = (λ w , λ b , λ σ ) and the family of conjugate distributions to the prior More specifically, we suppose that Q(w; λ w ) is of the same form as the prior p(w|Ω). That is, Q(w; λ w ) is a zero-mean Gaussian random field with covariance structure dictated by the Gibbs network induced by Ω. Thus, if the Gibbs field given by the prior is a compound symmetry Gibbs network, then so it is Q(w; λ w ), etc. Similarly, Q(b; λ b ) is a zero-mean Gaussian field with block diagonal covariance λ b = diag(λ b1 , λ b2 ) just as Γ. Also, just as the prior, the variational posterior approximation Q(Σ; λ σ ) is assumed to be an Inverse-Wishart(ν σ , Λ σ ). With this family of distributions, the ELBO becomes It is not difficult to show that the Kullback-Leibler divergences between the two sets of Gaussian distributions are KL Q(w; λ w ) p(w|ρ n ) = 1 2 trace Ω −1 λ w − (q + p)l 1 + log det(Ω)/ det(λ w ) , For the last divergence we have the following result.
The optimal solution is found using the stochastic gradient ascent variational Bayes algorithm (Paisley et al., 2012;Kucukelbir et al., 2014;Ye et al., 2020;Duchi et al., 2011). To obtain the gradient of ELBO(λ), the gradients of the Kullback-Leibler divergence terms are needed. In practice, these derivatives are usually computed through automatic differentiation. That is, calculating numerically the value of the derivatives (e.g., using Chebyshev polynomial approximation (Press et al., 1996, Ch. 5.7)). These methods achieve very good accuracies (Bartholomew-Biggs et al., 2000).
The integrated log-likelihood, the first term in (12), is intractable. We use Monte Carlo sampling to estimate it. However, we only need to estimate the score (Kingma et al., 2015;Mohamed et al., 2019), The integral in the right-hand-side of the above equation is estimated by Monte Carlo sampling of ψ σ from Q(ψ σ ; λ), which is a multivariate Gaussian-inverse-Wishart distribution.

Regularization on the CS-Gibbs model
For the compound symmetry Gibbs network model described at the end of in Section 2.2, the off-diagonal block Ω 12 = U ⊗ V , for U ∈ M l 1 ×p , and V ∈ M q×l 1 . These matrices are unconstrained. For the variational approximation, we suppose that the corresponding block of the precision matrix has the same structure. Let U 12 ∈ M l 1 ×p , and V 12 ∈ M q×l 1 be the corresponding matrices in the variational approximation.
Note that the our prior for the weights correspond to a Gaussian graphical model (Giudici & Green, 1999;Rajaratnam et al., 2008). The usual procedure to find sparse covariance models for Gaussian random fields is the so-called graphical lasso (Friedman et al., 2008). This method is a penalized log-likelihood in which the size of the entries of Ω are penalized, so as to render this matrix sparse. Translating this method to our already sparse model leads to the penalization of the L 1 -norm of the matrices U 12 and V 12 . That is, we consider the maximization of the penalized log-likelihood where β u and β v are the regularization constants. Introducing the variational approximation, we have and we can think of the right-hand-side of this expression as a lower bound for the expected graphical-lasso log-likelihood function. The goal is then to solve the maximizition problem This result may also be embedded in a special case of sparse variational inference (Campbell & Beronov, 2019) or Generalized ELBO with Constrained Optimization (Rezende & Viola, 2018), where we build a regularized framework for our compound symmetry Gibbs network in other to set sparsity constraints.
The optimization problem is not straightforward to solve. First, the L 1 -norm is non differentiable. And more importantly, the solution for the covariance matrix must still be a positive definite matrix. There are several algorithms designed specially to solve the usual graphical lasso model (Mazumder & Hastie, 2015). But, they are not directly applicable to our model. We have addressed these two issues. In this section, we only explain how we have modified the maximization problem for this special case of our model. The problem dealing with the positive-definiteness of the matrix applies to all our models. It is addressed in the next section.
To deal with the non-differentiability of the L 1 norm, we have implemented two modifications to the original problem. The first one consists of replacing the absolute values of the elements of the two matrices |u| by a smooth version of them. These are given by a smooth approximation (Schmidt et al., 2007) of the absolute value |u| α = (1/α){log(1 + exp(−αu)) + log(1 + exp(αu))}, for a given constant α. It can be easily seen that as the value of α increases |u| α approaches |u|. In fact, Schmidt et al. (2007) show that | |u|−|u| α |≤ 2 log(2) α . Therefore, the constant α must be fixed to a large enough value (a value of about 10 6 is large enough as suggested in (Schmidt et al., 2007), but smaller values also give good approximations). Our second modification is to replace the L 1norm by the Frobenius (that is, Euclidean) norm · of the matrices U 12 and V 12 . This modification can be thought of as a group-lasso version of the graphical lasso. Unfortunately, the Frobenius norm is also non-differentiable. Instead, we consider the squared of the Frobenius norm, which is a smooth function. In particular, it is easily seen that its derivative is given by The parameters β u and β v must be chosen adequately. We perform K-fold cross-validation to establish appropriate values for them. We stress that this is done during the fitting ( : ). Fit the model with all data (T, S) and parametersβ. In our implementation we have use M ≥ 100. The β values of the sequence were chosen automatically by the two-dimensional Golden-Section Search (GSS) algorithm (Chang, 2009). To initialize the search, several random values for each component of the vector β were drawn. Each draw is a value chosen uniformly at random from one of the intervals in the collection { +100 , +100( +1) : = 0, 1, . . . , max }, where, in practice, max ≤ 100, and ≤ 1/10000. For each draw, the interval was also chosen uniformly at random. Technically, the GSS is based on the fact that the minimum lies within the interval defined by the two points adjacent to the last observed value. The method operates by successively narrowing the range of values on the specified search space, which makes it relatively slow, but very robust. It finds the best extrema (β opt u , β opt v ) after going through all the regions. GSS presents an oracle complexity which converges to an -accurate solution at a linear rate, also known as geometric or exponential convergence in the numerical literature (Bertolazzi, 2008;Sebastien, 2013). In fact, GSS makes O(log(1/ )) calls to an Oracle Query Optimizer (Wijesiriwardana & Firdhous, 2019) to compute the optimum.

Keeping positive definiteness on the precision Matrix
Let λ w be the precision matrix associated with the variational density approximation to the posterior of the weight parameters w. The gradient of the ELBO with respect to the variational precision matrix λ w , denoted ∇ ELBO(λ w ), is computed component-wise or block-wise in all sparse structures. For example, in the Sparse-Gibbs network where the precison matrix λ w has the form given by (4), we compute each gradient ∇ ELBO(V i• ), ∇ ELBO(V• j ), and ∇ ELBO(V ii ) separately in order to build the full update of λ w . The same differentiation principle is applied for the sparse compound symmetry Gibbs model (sparse-CS-Gibbs network) where we constrain U 1 , U 2 , V 1• , and V• 1 to have compound symmetry.
One of the issues that arised on the estimation of λ w is how to ensure in practice that the matrix is kept positive definite during the gradient updates. Contrary to graphical lasso (Mazumder & Hastie, 2015), the positive definiteness property can be lost. There is no known algorithm that can ensure this property for our models while optimizing the matrix. To find a very good or optimal solution might require heavy exploration of the matrix space at each gradient update. Hence, it might be impractical to optimize the ELBO over λ w with simultaneous constraints on both the sparse structure, and the positive definiteness property. Given the complexity of the model, and from a statistical viewpoint, it might be sufficient to obtain an approximate solution for λ w that is both sparse as imposed, and positive definite. Therefore, betting on this observation, we based our matrix updates on the search for the nearest positive definite matrix (Higham, 1988). An simple procedure to find the nearest positive definite matrixλ w , where proximity is measure by the Frobenius norm, is based on the spectral decomposition of the matrix (Jewbali & Ore, 2009). We first determine the spectral decomposition of λ w = QDQ T , where Q is an unitary matrix, and D is a diagonal matrix with the eigenvalues of λ w . An estimate of the nearest-positive definite matrix to λ w is given byλ w = QĎQ T , whereĎ is the diagonal matrix D modified with all negative eigenvalues set to a small positive constant. However,λ w might not be the nearest positive definite matrix. The optimal small constant needs to be searched. One algorithm that does this search is the Iterative Spectral Method (Marée, 2012). Unfortunately, this and variants of this algorithm are costly, taking order O(d 3 ) operations for a d × d matrix.
Two practical ways to verify if the above approximation is necessary, are, first to try a Cholesky decomposition of λ w after each update, or to apply the well-known Sylvester's criterion (Gilbert, 1991) to each update. The latter verifies if all leading principal minors of λ w are positive. This criterion is doubly usefull, since it also helps to build the nearest symmetric positive definite matrix, with the imposed model structure. The minors can be used for non-positive definiteness correction. In fact, the kth principal minor is computed based on the (k − 1)th principal minor augmented with the corresponding reduced kth column and row. If the (k − 1)th principal minor is positive, and the kth principal minor is not, then one can make this latter positive with some slight modifications to the kth column and/or row. This can be done sequentially until all principal minors has been computed. However, one drawback of this method, besides its computational cost, is that it does not always guarantee to get the nearest positive definite matrix in terms of the Frobenius norm.
In our implementation, we use the nearest Positive Definite Matrix computation, which is mathematically equivalent to the gradient projection method known as Projected gradient updates (Cruz et al., 2011;Hassani et al., 2017). The combination of the gradient updates and the projection into the nearest positive definite matrix space is called Iterative projected gradient (IPG). This iterates between calculating the gradient and calculating the projection into the positive definite matrix space S+ . More explicitly, at iteration k, the IPG computes where, α k is the step size, and P S+ denotes the Euclidean projection into the positive definite matrix space, that is, where, · F stands for the Frobenius norm. To compute P S+ (λ w ), we have applied an approximation by matrices positive semi-definite on the subspace S+ as detailed in (Hayden & Wells, 1988). From properties of the Frobenius norm, we have that: Thus we minimize A − λ w F by minimizing B − A F . Now, our approximation in the subspace S+, goes as it follows: . This is a transformation technique to force A to have the same properties as the symmetric matrix B.

• Compute the spectrale decomposition of A, let say
Then the unique best approximation A + to A is given by: where Λ + is obtained from Λ by replacing each negative eigenvalue by a number a > 0. We denote the new matrix as A(a). So if there are no negative eigenvalues, A is taken as our approximation.
• If A as negative eigenvalues, we generate multiple values 4 of a, to find a opt , the optimal value of a.
However, we have also found that this projection was not robust. Our algorithm has then been reinforced with the mbend package from Nilforooshan (2020), which increase robustness by adding a weight matrix to A + (Jorjani et al., 2003), or by replacing each of the m negative eigenvalues (λ i ) with ρ (s − λ i ) 2 / 100s 2 + 1 , where ρ is the smallest positive eigenvalue and s = 2 m i=1 λ i . This is called methode LRS14 in their R package Nilforooshan & Nilforooshan (2020). To obtain the (ultimate) best solution P S+ (λ w ), those procedures are repeated until convergence is reached.

Predictive Posterior
In this section we describe how we estimate the predictive posterior distribution. That is, the estimation of p(y (n+1) |x (n+1) , y n , x n ) for a new data item x (n+1) , which is the goal of our Potts Gibbs-network. Since the mixture is over all possible partitions ρ n+1 of the data x n ∪ {x (n+1) }, we have As in (Murua & Quintana, 2017), we consider the collection A(ρ n+1 ) of all partitions ρ n of x n giving rise to the partition ρ n+1 of x n ∪ {x (n+1) } by generating appropriate bonds b n+1 in the extended graph with vertices x n and x (n+1) . The extended graph includes all edges of the the original graph plus the edges added to link the new data point. (e.g., if the graph is a k-nearest-neighbors graph, then all edges between x (n+1) and its k nearest neighbors are added). Elements in the collection A(ρ n+1 ) will be denoted as [ρ n , b n+1 ] to make clear that ρ n+1 can be generated from ρ n and the bonds b n+1 . Then, we can write Therefore, the predictive posterior p(y (n+1) |x (n+1) , y n , x n ) is the mixture: This expression is intractable since the number of possible partitions of x n is the well-known Bell number B n . Fortunately, having sampled partitions and parameters from the posterior of ρ n and ψ σ , we can use these samples to estimate the predictive posterior as follows. Let {(ρ n, , ψ σ, ) : = 1, ..., M } be a sample from the posterior distribution. Also, let {b n+1, : = 1, ..., M } be a sample of bonds linking the sampled partitions ρ n to x (n+1) . The estimate isp The terms in the sum are Let be the cluster in which x (n+1) falls in the partition ρ n+1 , and suppose that this Hence all the terms cancel in (17) except for the term associated with x (n+1) . Therefore Using any value of ψ σ in expression (9), we get

Remark
In particular, In practice, we suppose that we can approximate p(ψ σ |T ∪ {t},S, ρ n+1 ) by p(ψ σ, |T, S, ρ n ), and that, in turn, this last posterior is well approximated by the variational posterior approximation Q(ψ σ ; λ). Then each one of the expectations can be approximated by Monte Carlo, using our samples {ψ σ , } M =1 from the variational posterior, where denotes the cluster in ρ n+1, in which x (n+1) falls, = 1, . . . , M . This yieldŝ The integral over t, which is a univariate integral, may be computed numerically, for example, using Romberg's algorithm (Li & Hu, 2017).
In particular, the value of the prediction E(y|x (n+1) , y n , x n ) may be estimated by the quantity: The Shallow Potts as Random Gibbs Network Forest. There is an intuitive mathematical equivalence between Random Forest and the Shallow Gibbs. Random Forest is an ensemble learning : a machine learning method that combines many simple learners -base models-to construct an optimized predictive model (powerful model).
Random-forest does both row sampling and column sampling with Decision tree as a base learner ( Svetnik et al. (2003)). Analogically, the Shallow Gibbs does random clusters and rather than column sampling, it projects the data x into a fewer dimensional space on the shallow layer (one hidden layer with very few reduced number of neurons).
The variance will decrease in Random Forests, as you increase the number of base learners τ . When you reduce τ , there is a rise in variance. For the whole process, however, bias remains unchanged. It is commonly said: Random forest= Decision Trees (as a base learner)+ bagging (Row sampling with replacement)+ feature bagging (column sampling) + aggregation (mean/median, majority vote) [Skurichina & Duin (2001) The Shallow Potts as a Mixture Model (As a reminder of some practical computations) Computing expectation E(y (n+1) |x (n+1) , y n , x n ) in a proper way would require an exact probability for each partition in 15. The value of the prediction E(y (n+1) |x (n+1) , y n , x n ) need to adjusted with an estimate of p(ρ n |y n , : it simply means that each partition has a probability that need to be taken into account in the final estimation, plus an estimation of the conditional probability of the new bonds (b n+1 ). To solve this problem, we have approximated p(ρ n |y n , x n ) ≈ 1/M , and have introduced a smoothing parameter 0 ≤ δ ≤ 1 (δ =p(b n+1 |ρ n , x (n+1) , x n )) that will represent an estimate of this probability. This is not the optimal way, in a sense thatp(b n+1 |ρ n , x (n+1) , x n ) is different from a partition to another one. But, il will help out to transform the final prediction to: where M is the number of partitions generated during the training step.

On Double backpropagation
Double backpropagation has various used cases. Under the name double backpropagation, the concept of penalizing the output gradient with respect to the input was introduced (Ross & Doshi-Velez, 2018), and later used to locate large connected areas of the error function called flat minima. The ultimate aim was the development of their models generalization, which is our goal here. It has effectively improved generalization performance [Drucker & Le Cun (1992), Drucker & Le Cun (1991)], and has also been applied to other adversarial models ( (Seck et al., 2019), (Sun et al., 2020)), in character recognition (Kamruzzaman & Syed (1998), Kamruzzaman et al. (1997)), in robustness and saliency map interpretability (Etmann, 2019). We present in the following line our novel double backpropagation scheme (DBS).

A Proposed Double backpropagation scheme
Similarly to our Iterative projected gradient (IPG) applied to our model variational parameter λ w , we found that the Mean Squared Error (MSE) of our regression model can also be back-propagated (Rumelhart et al. (1986); Hecht-Nielsen (1992)) w.r.t to each of the model parameter, i.e ψ = (b (1) , W (1) , b (2) , W (2) ), the main parameters of the network, with b (1) ∈ R l 1 , W (1) ∈ M q×l 1 , b (2) ∈ R p , and W (2) ∈ M l 1 ×p , l 0 = q, l 2 = p, Σ ∈ M p×p being the variance-covariance matrix of y. First, we know that y|x, ψ, Σ is distributed as a multivariate normal distribution Then, by applying the sampling method described in the Cholesky decomposition section B, we have :ŷ such that L ∈ M d (R) is a lower triangular matrix such that Σ = LL T , and u ∼ N (0, I). The double backpropagation scheme for the Shallow Gibbs -which is another main contribution of this research -goes as follows: 1. Using the IPG (in 14), apply backpropagation method on hyper-parameter λ w to reduce its Kullback-Leibler (KL) estimation error. Once done, generate an estimateψ 0 of ψ = (b (1) , W (1) , b (2) , W (2) ), using Monte Carlo sampling method from the variational distribution of the parameters.
2. Use equation 19 to backpropagate the M SE(y i −ŷ i ) = y i −ŷ est,i 2 to updateψ 0 in the Potts cluster and per observation as follows:ψ where ψ,t is the learning rate schedule [See C.1] for this gradient update of ψ at step t, andψ t,i is the value of ψ at iteration t for observation i. Equivalently, for the i-th data x i in the training data, it corresponds to the changes: where [z] T means the translate of z, g 1 the derivative of g 1 , ψ i,t = (W i,t , W i,t , b (1) i,t , Σ i,t ) represent the change for ψ for the i−th training observation at step t; ψ i,t = ( W (1) is the learning rate of those gradients updates; L i,t the Choleski factor at step t from Σ i,t ; and finally u i,t ∼ N (0, I). The changes for the i−th data in the testing set, is the change of its nearest neighbor in the current cluster w.r.t to the Euclidean distance 5 .
The product symbol means that the vector [f ψ i,t−1 (x i ) − y i ] T will multiply each line of the final computation on his left. We will be calling the optimization scheme above as the Double Backpropagation Scheme -DBS optimizer. The DBS is an universal Bayesian neural network optimizer: given an initial ψ i,0) = (W i,0 , W i,0 , Σ i,0 ), 5. We have updated this choice during our experiments. andŷ est,(i,0) , it can makeŷ est,(i,t) converge to the right response value y i with appropriate learning rate schedule. To find this appropriate learning schedule is an open work, and one can look for its convergence properties using the stochastic gradient learning convergence theorems, to show howŷ est,(i,t) converges, and his speed of convergence.

On Differentiation Methods
Differentiation occurs everywhere, from the backpropagation algorithm in deep neural networks to the motion equations in physics to almost any region that needs to calculate a rate of change. When abording differenciation in machine learning, we encounter automatic differentiation, symbolic differentiation, numerical differentiation. We have introduced here in this paper empirical differentiation as a novel differential machine learning framework.

Automatic differenciation.
Automatic differenciation (also called autodiff in a short) is a set of techniques to automatically construct a procedure and numerically evaluate the derivative of a function, given the analytical expression of the function itself [Griewank et al. (1989), Baydin et al. (2018), Rall & Corliss (1996), Bücker et al. (2006), Bischof et al. (2008a)].
Automatic Differentiation can only calculate the partial derivative of an expression on a certain point, when we have assigned values to each of the variables . It has many applications such as numerical integration algorithms for ordinary differential equations (Bischof et al., 2008b), in Wave-function positivization (Torlai et al., 2020), in inverse design of photonic crystals (Minkov et al., 2020), for numerical solution of the Burgers' equation (Asaithambi, 2010), in nonlinear Mathematical Programming Language models Gay (1991), in periodic orbits and their bifurcations Guckenheimer & Meloon (2000), Runge-Kutta methods for optimal control Walther (2007), in engineering design Barthelemy & Hall (1995), in the computation of sparse Jacobian matrices Coleman & Verma (1998), in Fourier series and the radii polynomial Lessard et al. (2016); and has already been implemented in Matlab (Bischof et al. (2003), Forth (2006), Verma (1999)) and in Pytorch (Paszke et al. (2017)), as well as in The Stan Math library (Carpenter et al., 2015).

Symbolic differentiation (SD).
With respect to a defined variable, symbolic differentiation (Knott (2017)) is a program that seeks the derivative of a given formula, generating a new formula as its output (only applicable to closed-form mathematical functions). As for example, let say we are given function 27.
There is also a link between Symbolic Differentiation (SD) and Automatic Differentiation (AD). In The Simple Essence of Automatic Differentiation, Conal Elliott wrote: Automatic Differentiation is Symbolic Differentiation performed by a compiler Elliott (2018).
Laue (2019) argues an equivalence between both methods claiming that automatic forward mode differentiation and symbolic differentiation are similar in the sense that, when computing derivatives, they all perform the same operations.

Numerical differentiation.
This kind of differentiation is much more appreciated when we use the derivative formula in [29], with the analytical form of the function f itself. With numerical differentiation, the derivative of a function of ∇f, f : R n → R at x = a (also called Finite difference approximation) is the limit: with e i ∈ R n the i-th unit basis vector, and h is a small step size. This previous computation is very useful to approximate the gradient ∇f = ∂f ∂x 1 , . . . , ∂f ∂xn . There are various forms of numerical differentiation in the literature, such as stable numerical differentiation, regularized numerical differentiation, noisy numerical differentiation, etc. [Cullum (1971), Ramm & Smirnova (2001), Anderssen & Bloomfield (1974), Chartrand (2011), Knowles & Renka (2014), Mboup et al. (2009), Lu & Pereverzev (2006 In the next section, we have introduced a novel differential machine learning framework based on an empirical differentiation procedure, which is a generalization of interpolation. for the prediction of y test when we know x test . This form of data closest neighborhood association is one limitation of the DBS optimizer, since it simply implies that we are still searching for the right learner.

Interpolation as a Machine learner
You can notice with table [9], that in few iterations, the DBS can overfit the training data. The main question then is: knowing a perfect relation [46] for any train data, as (x train j , ψ train j ) : x train where ι is our denoted gate function, how can we find the perfect ψ test i for the test data as: Or simply said, is Nearest Neighborhood enough to find ψ test i , y test i , for a given x test i ? This is where Multivariate Interpolation come into action, as the later can be taken also as a machine learner, because it can refine Nearest Neighborhood Train-Test association [43]. To perceive this, we present another perspective of machine learning problems with the following described properties about interpolation.
Linear interpolation usually requires two data points (u a , v a ) and (v b , v b ) , and at the point (u, v), the interpolation equation is given by:  (2015)]. We already know the best linear approximation to ι. It involves the derivative Dι(a) such as: where Dι(a) is the matrix of partial derivatives of ι evaluated in the neighborhood of a, and • is the dot product between both vectors Dι(a) and (x − a). This approximation is linear and represents the first-order Taylor (2011))) Let ι : R n → R be a m ≥ Ξ ≥ 2 -times-differentiable function in a certain vicinity D of the point u 0 = (u 01 , ..., u 0n ) ∈ R n . Then, for any v = (v 1 , ..., v n ), we have :

Remark
It is not a coincidence that Taylor Approximation theorem is only defined in a certain vicinity or a given neighborhood set! One of the main contribution of this chapter is to understand that : Multivariate Interpolation using Taylor theorem is the refined generalization of simple Neighborhood Train-Test association. Notice in equation [33], if a = x train and x = x test , we have ι(x test ) ≈ ι(x train ), when we suppose x train ≈ x test . To avoid that simple approximation, we add more differential terms, as exposed in Proposition [9.1].
Can we overfit with proud? It is important to also understand the case ι : R q → R m in [30], and how we expand Dι(ζ) at ζ = x train 0 . When ι is multivariate, we will use Proposition [9.1], or in a one order [33], to find [31 ] per component using the assumption (F d ) that : There is an approximable differentiable function ι such as: ι(x test )] for two training data j 1 and j 2 , With the precision that the training data point j 2 and the test data i are taken in a very closed neighborhood of j 1 .
In each Potts partition ρ n clusters, this is applicable. Suppose x train (1), ..., x train j 2 (q)), with the precision that the training data point j 2 is a taken in a very closed neighborhood of j 2 (potential example is: the associate Potts cluster of j 1 given a partition ρ n ). In practice, for x = x test (k)) are useful to indicate that we do not want the partial derivatives to be computed if there are no differentiation available for the k th dimension. Further, we shall reduced the Taylor series in [9.1] to integrate all those constraints, and the final computation will be called reduced derivative.
To use this technique of Reduced Taylor Series Multivariate Interpolation in [34] with the research of the nearest neighbor in the train dataset for a new x test , is what we call the (dist)-Nearest Neighbor-(h)-Taylor Series-Reduced Multivariate Interpolation (dist-NN-(h)-TS-RMI) in reference to previous existent work [Rukundo & Cao (2012)], and it can reduce (possibly) test error, specially when we overfit sometimes. In dist-NN-(h)-TS-RMI, dist is set for the Nearest Neighbor distance applied [e.g 43], and h is the order of computed Taylor approximation in [34]. Assumption [9.2] denoted (F d ) may not be valid, but the method may be effective for some datasets; or another situation that could arise is that the closest data point j 2 to j 1 in a very small neighborhood set may not exist in the training data. Some experiments are required to confirm in which case the proposed dist-NN-(h)-TS-RMI model is a good learner.
Taylor Theorem can approximate any (differentiable) function. Two questions of interest may arise then: Q 1 What is the statistical distribution of any dist-Nearest Neighbor-(h)-Interpolated-Reduced response variable y, as the parameters of the model are shifted (by interpolation and reduction) ? This is a great revolution in Statistical Learning Theory, as we can sufficiently (i.e h ≥ 2) interpolate any machine learning regression problem, with the reduced differentiation form when we know the perfect parameters that fit the training data. By then, the famous overfitting problem will not appear to be an hindrance anymore, but an extraordinary gateway to another type of learning method, which is therefore the research of the perfect fit for both the training and the test data, using as e.g the proposed dist-NN-(h)-TS-RMI model.

Data Augmentation for Empirical Differentiation (DAED)
This section is a trial to answer question Q 2 , as Taylor Approximation can truly solve any machine learning problem. We illustrate this fact using data augmentation. To make the assumption (F d ) valid, we need to create more samples from existent training data.
To understand this intuition, remember that the partial derivative of a function ι (x 1 , . . . , x n ) in the direction x i at the point (e 1 , . . . , e n ) is defined by: with δ a ∈ R has to be a very small real number. Because the closed neighborhood training data j 2 for j 1 in the differentiation computation [34] is not available in practice in the training data, we can create more data as follows, to compute derivative [35] and [34] with almost exact precision: x train This data augmentation framework, the DAED, transforms the dist-NN-(h)-TS-RMI model in [9.2] into another one, which is the : (dist)-Nearest Neighbor-(h)-Taylor Series-Perfect Multivariate Interpolation (dist-NN-(h)-TS-PMI). We have also find out that in practice, the reduced model dist-NN-(h)-TS-RMI was not precised, and not effective differentiation approach. We have run some experiments with Slump dataset, we still found the DBS as a model that overfit the training data, but the reduced dist-NN-(h)-TS-RMI model can not generalize very well. The Perfect Multivariate Interpolation method, let say the dist-NN-(h)-TS-PMI) model, is recommended.

Generalization Method
In the previous section, we see how data can be augmented. Now, it is time to reveal how we generalize the model in order to make any test data to have its associate neighbor in the training data. The secret is to understand that every augmented data can also be sufficiently differentiated, as much as we want to overlap all the possible data Space domain: x aug-train (1) + δ a , ..., x train (1) + s 1 · δ a , ..., x train j 1 (q)) (x train j 1 (1), x train j 1 (2) + s 2 · δ a , ..., x train with (s 1 , s 2 , ..., s q ) is a q-tuple of integers or rationals. The re-training process has to be done progressively to obtain their perfect associated parameters. The last, but not the least, is that each novel augmented data can again be differentiated as done in [49], (again and again) with no end.
We also found that in practice, this generalisation method is effective for slump dataset, but it takes many data augmentation iterations over the training data to reach convergence for the test data.

Experimental evaluation
In this section, we study the performance of the Potts Gibbs-network mixture model, Potts-Gibbs-NM for short, on real datasets. As a baseline, we compare the performance of Potts-Gibbs-NM with that of the classical feedforward neural network, and to the multi-layer multi-target regression (MMR) model of Zhen et al. (2017). The performance of the models is measured on datasets from the Mulan Project (Tsoumakas et al., 2011).
As described above, there are four variants of the Potss-Gibbs-NM model: the fully-connected Gibbs network or full Gibbs network (Full-Gibbs, for short), the between-layer sparse Gibbs network (B-Sparse-Gibbs), the sparse shallow Gibbs network (Sparse-Gibbs), the sparse compound symmetry Gibbs network (CS-Sparse-Gibbs), and the compound symmetry Gibbs network (CS-Gibbs). The activation function used for all the neural networks in this study is a smooth version of the Rectified Linear Unit or ReLU, namely, log(exp(x) + 1) ≈ max{0, x} (Lee et al., 2019). All the code used in the experiments can be downloaded from the first author's github site (Alahassa, 2020).
To appreciate how we tweak each of our model, you need to pay attention on 13 factors: • NumberPartitions (NP): an integer for the number of Potts model partitions generated; • Minimum Potts Cluster Size (MPCS): an integer for the minimum size of each cluster in all partitions generated; • NHLayer (NH): an integer for the number of neurons on the hidden layer (l 1 ); • epoch times (ET): an integer for the number of epochs training times; • learning rate (LR): an integer for the learning rate of parameters • Number EpLogLike (NE): Number of times we simulate λ w to estimate the Expected-loglikelihood Score inside the KL to optimize.
• batch psize (BS): an integer for the proportion of batch training data; • Simulate W b Pred (SWbP): After estimation of variational λ w , Number of times We simulate W and b from to get a sample of (W,b) ready to be used for sampling the predicted test data from the model.
• Pred Simulate Ytest (PSY): Number of times we simulate Ytest given [each] W and b from our previous sample of (W, b). It means, for each W,b, we sample "Pred Simulate Ytest" times the Ytest predicted response data from the model.
• Simulate Proba Partition (SPP): After estimation of variational λ w , Number of times We simulate W and b from to get a sample of (W, b) ready to be used to estimate the probability of acceptation each partition from the model.
• The smooth δ applied to smooth slightly every prediction [see 6] • The number of times we backpropagate for the updates ψ i,t = (W (1) (2) i,t , Σ i,t ) for all the model parameters [see 7.1].
• The learning rate vector dbs for all the parameters involved in the DBS optimizer.
The experiments were performed on 5 multivariate multiple regression datasets (Slump, EDM, Jura, Water Quality, and SCPF) taken from the multiple-output benchmark datasets available in the Mulan project website (Tsoumakas et al., 2020). The datasets are shown in Table 1.
where (x i , y i ), the i-th sample in the testing-set D, is composed of features x i and ground truth y i ;ŷ i is the model prediction of y i ; andŶ is the average of the adjusted valuesŷ over the training-set. We take the average RRMSE across all the response dimensions (target variables) within the testing-set Dtest. A lower aRRMSE indicates better performance. The MMR model has already substantially outperformed the best results from state-of-the-art algorithms on most of these eleven datasets. But, one counterfactual thing we have noticed is that the Relative Root Mean Squared Error (RRMSE) is not a measure of goodness of fit.
THE SIZES OF THE TESTING DATA IS 20% OF THE WHOLE DATASET EACH TIME.

The Results
The MMR Relative Root Mean Square Error (RRMSE) is beatable, but RRMSE is still not the best...  Some Comments about the model structures. We have designed the sparse and B-sparse Gibbs network for training with a small number of observations. As a side-effect, our sparse networks are also computationally fast to train. In fact, we observed in practice that in general, and in a relative comparison to the fully-connected Gibbs network (given the machine used ), the sparse network structures speed up training to 2.5x≥ (sparse-Gibbs) and 1.5x≥ (B-sparse-Gibbs), all while doing a single training (epochs) at a time, specially for Slump dataset [see 4, 7, 5].
When we look at the sparse network structures, we also observe that the gradients oscillate or vanish less than in the fully-connected and compound symmetry Gibbs Networks. Moreover, we can argue (generally saying) that the sparse structured networks is the most effective at reducing the RRMSE than the fully-connected Gibbs Network, more effective than the compound symmetry Gibbs Network variants [4,7,5]. Of course, it depends on the dataset, and this finding is not consistent. However, it is impressive that sparse models can match and beat the performance of more dense networks with fewer weight parameters. Probably, one of the reasons for this behavior is that sparse neural network models are known to capture features useful to uncover broader and more general aspects of the data, resulting in better (learning) prediction (Liu, 2020). As many other works have shown (Srinivas et al., 2017), our work shows that sparse learning is possible.
The compound symmetry structure imposed on the precision matrix Ω is conceived so as to simplify the model in terms of parsimony. As opposed to the fully-connected Gibbs network for which the unstructured precision matrix specifies no patterns in the weight spatial connection (that is, the precision matrix is completely general), the compound symmetry structure specifies that weights on the same layer have homogeneous (or nearly equal) covariances, and homogeneous (or nearly equal) variances. In practice, this saves a lot in terms of number of parameters, leading to training times up to more faster than the fully-connected network in some cases. Obviously, these results cannot be over-generalized too.
There is an open door for research about conceiving a good Bayesian tests (Mulder & Fox, 2013), or other model selection mechanisms, to choose the best suitable model among the shallow Gibbs networks for a given dataset.     To reach this convergence, we have identify the need to modify this optimizer with a slight correction. Mainly
So each training data has its own learning rate which is here set as ŷ est, (i,t) . This is valuable for each test data as follows:ŷ where for the k-th test data x test k the changes f ψ i,t , L i,t , u i,t are taken from the j-th training data y j which verify: where the operation Mean(u) for vector u is taken upon all dimension of u. In our experiments with Slump dataset, we have choosen test y est, (i,t) and ŷ est,(i,t) to be double time larger than: In equation [45], ∂M SE(y−ŷest) ∂ŷ est has to be approximated as : ).
When the training error was getting lower, with a not decreasing test error, we double again test y est, (i,t) compared to train y est, (i,t) specifically. The convergence property of the DBS optimizer in equation [ 45] for the training data is not surprising due to the convergence of gradient descent properties [Lee et al. (2016); Ruder (2016)] for the Mean Square Error M SE(ŷ est ) = y −ŷ est 2 F which is convex inŷ est . However, because of the approximation [44], this convergence is not reached with the same speed for the test data, but the model still do converge. In their case, to reduce test errors to (zero), a last ingredient needs to be applied, called Universal Approximation Theorem: Proposition 12.1 (Guilhoto (2018)) A Neural Network with a single hidden layer and sufficiently many nodes is capable of approximating any continuous function.
Intuitively, it means that we need to increase the number of neurons within the hidden layer, to reach a good prediction for the test data. This is also valid if the model was multi-layer architecture. To understand this intuition, look for equation [45], and beware that when convergence is reached for the training data, we have : ∂ŷ est = 2 * (y i −ŷ est,(i,t) ) = 0 and equation [45] for the test data x test becomes: We simply need then to find the right (neural) machine that will associate with perfection x test to y test , as: Or simply, very like an encoder or neural cryptography machine [Ruttor et al. (2007), Ruttor et al. (2004), Volná (2000), Kanter & Kinzel (2003) x test To obtain this machine ψ test i which has to encode as much as information to be efficient, we need to extract it from its associate training data in equation 6 [43], and to make it robust, the Universal Approximation Theorem i ] to be high dimensional 7 , i.e. an appropriate number of neurons on the hidden layer. This last requirement transforms the DBS optimizer into another DBS optimizer [the combination of Universal Approximation Theorem and DBS optimization] with an optimal number l 1 of neurons on the hidden layer [see 2] of the Shallow net, an adapted number ζ of DBS optimization, an optimal DBS learning rate dbs , called the (l 1 , ζ, dbs ) − DBS, and which, combined with findings of section [9], i.e, the (dist)-Nearest Neighbor-(h)-Taylor Series-Perfect Multivariate Interpolation (dist-NN-(h)-TS-PMI) presented in [10], is the Perfect fit 8 (or the Perfect learning) for the Shallow Gibbs Network, summarized in equation [47]: 6. This equation can be improved too for better Train-Test association. 7. Where the model can also be multi-layer as well. 8. Chapter [9] is added to reinforce the (l1, ζ, dbs ) − DBS model.
where M SE Train , M SE Test are the Mean Squared Error of the train and test data respectively, dist opt is the optimal distance for the research of the nearest neighbor in the training dataset for each test data x test i , h opt is the optimal order of the Taylor approximation for the Perfect Multivariate Interpolation (dist-NN-(h)-TS-PMI) model once the (l 1 , ζ, dbs ) − DBS has overfitted the training dataset. l 1,opt , ζ opt are respectively the optimal number of hidden neurons, and the optimal number of DBS updates. dbs integrates simultaneously the DBS learning rate vector for all the model parameters, the DBS learning rate for the training data, and the DBS learning rate for the test data. dbs,opt is the optimal one. We have performed two more experiments to confirm this fact, summarized in table [9]. Table [9] may be treated as over-fitting, but the advantage of this finding is a confirmation of the convergence (  0) of the with (l 1 , ζ, dbs ) − DBS, where we may applied a simple optimization -[Grid Search or Golden Section Search method]-to find the perfect parametrization of (l 1 , ζ) and the required learning rate dbs .
We have also found in practice, that when l 1 and ζ are too large, the model may diverge. So, it is important to find the right ratio ζ/l 1 (with appropriate DBS learning rate).
When there is enough training data available, another way to increase the convergence speed for the test data is to modify the criteria used in optimization [43], and use a distance dist for which each test data x test k is ensured to find an associate x train i in the training data with : where ε is a very small number.  , N E = 2, BS = 0.8 , SW bP = P SY = SP P = 5, DBS learning rate = 10e − 2, and no smoother δ is applied; test data DBS learning rate = 2 * (10e − 2), (*) the DBS learning rate of the test data is twice the learning rate of the training data machine being used, increasing the training times (ET==epoch times) for the neural networks in each cluster of the Shallow Gibbs model is costly in CPUs [ Figure 2]. The CPUs available are overfull in less than 20 minutes of execution. To solve this problem, we have applied Reusable Process Pool Executor which is a novel framework that combines threading and multiprocessing primitives for robust concurrent futures (McCullagh (2017)). We present in chapter ?? a framework to build a new structured Potts clustered Gibbs Multivariate Regression model which is properly say is Random Gibbs Neural Network Forest (Barber & Bishop (1997), Barber & Bishop (1998)), by combining structured precision matrixes, Potts Neural Network Regression (SPNNR), variational learning and backpropagation. You may call the model under many appellations: Shallow Gibbs Network, Random Gibbs Network Forest, Shallow Potts Neural Network, the Potts-Gibbs-NM model etc. The model has four variants: the fully-connected Gibbs network or full Gibbs network (Full-Gibbs, for short), the between-layer sparse Gibbs network (B-Sparse-Gibbs), the sparse shallow Gibbs network (Sparse-Gibbs), the sparse compound symmetry Gibbs network (CS-Sparse-Gibbs), and the compound symmetry Gibbs network (CS-Gibbs). As literature references to many alike sparse-structured or similar models, you may look for the work from: Ionescu et al.  (2019), with an exception of the Compound symmetry Gibbs network, which is a particular innovation of our research. We can even say that one main innovation of this research is the introduction of compound symmetry precision matrix on the weights on our Bayesian Shallow neural networks. The effectiveness of our results will definitely increase interests in those types of matrices.
The neural network (our base learner) weights are Gaussian Markov Random Field distributed. To offer flexibility in terms of model structure, the precision matrix has been modified to infer three variants of analysis : sparse analysis, compound symmetry analysis and fully-connected framework. Those models have the properties to adapt themselves easily to the data in a few iterations during the learning process. The model is mounted in three stages: a set of data y n = {y 1 , . . . , y n } ⊂ R p with associate covariables ; each y i following a Gaussian distribution p(y|x, ψ, Σ) = (2π) −p/2 |Σ| −1/2 exp{− 1 /2(y − f ψ (x)) Σ −1 (y − f ψ (x))}., with covariance Σ and mean f ψ (x) which represent specifically a neural network function of the covariable x. The weights and the biases of the model vectorized as ψ = (b (1) , W (1) , b (2) , W (2) ) are high dimensional and their size can be increased at will. But, keep in mind that the goal of the Shallow Gibbs Network is to build a simplified model with few neurons and better mean squared error in a very short training time (reduced number of epoch ≤ 10).
First, we define the Kullback-Leibler of the model variational distribution [See 5] that will help to simulate variational parameters. This approach is very rich in terms of properties and convergence (da Rocha et al. (2011), Wiegerinck & Kappen (2000), Challis & Barber (2013)); even though it suffers from many drawbacks as the positive definiteness of your precision/covariances matrices during gradients updates [see 5.2]. Gradients updates have good advantages such as a fact approximations in a few iterative steps, mainly if you tweak appropriately the learning rate. When you applied an approximation for your parameters or hyper-parameters during the updates, you step forward the Iterative Projected Gradient (IPG) updates approach (Cruz et al. (2011)) which converge also if the projection space is an optimal subspace. As the case of PSD -Positive Semi Definite -approximation is current in literature, we can index this paper "Approximation by matrices positive semidefinite on a subspace" from Hayden & Wells (1988) that was a great help in building our model.
It is no need to say that the double backpropagation scheme in [see 7.1] is truly a requirement to reach an oustanding result. The secret behind is that the second backpropagation system represents itself another neural network by its own. In other words, given a set of initial predictions, for the data (x, y) and an estimated/or random values for ψ = (b (1) , W (1) , b (2) , W (2) ) and Σ, it reduces naturally the model error by backpropagating under a suitable learning rate schedule. Our experiments and results were not only efficient in comparison to the MMR model Zhen et al. (2017), but also was impressive against Murua & Quintana (2017) model.
The four (5) models developed in this framework include a clustered neural network regression via a Potts model. Monte Carlo simulations are limited to run posterior simulation for these models, and variational inference was the last resort. This novel representation (framework) aims to capture also more complex pattern in small datasets with high dimensional features. We have increased relative prediction power compare to simple prediction, and offered to the novice an adapted regression model to heterogeneous data, that can surpass any Multilayer Feed Forward Neural Network (FFNN) and shall over-become the next revolution of neural networks. And finally, we have found for the Shallow Gibbs Network model [2], the perfect learning configuration is: the dist-NN-(h)-TS-PMI)-(l 1 , ζ) − DBS, which is a combination of the Universal Approximation Theorem, and the DBS optimization, all coupled with the (dist)-Nearest Neighbor-(h)-Taylor Series-Perfect Multivariate Interpolation (dist-NN-(h)-TS-PMI). It indicates that, with an optimal number l 1 of neurons on the hidden layer, and an adapted ζ of DBS updates, an optimal distance dist opt in the research of the nearest neighbor in the training dataset for each test data x test i , an optimal order h opt of the Taylor approximation for the Perfect Multivariate Interpolation (dist-NN-(h)-TS-PMI) model once the DBS has overfitted the training dataset, the train and the test error converge to zero (0). This finding is a great revolution in all fields and subfields of Statistical Learning Theory, from the now to the forever. The generalization power of this model is infinite, as the secret is to understand that every augmented data can also be sufficiently differentiated, as much as we want to overlap all the possible data Space domain: x aug-train with (s 1 , s 2 , ..., s q ) ∈ N q is a q-tuple of integers or rationals, where x aug-train j 1 is our augmented data in the training set, and δ a ∈ R a very small number. The re-training process has to be done progressively to obtain their perfect associated parameters [see 11].

A Generalized Double Back-Propagation Scheme (GDBS) for any parametric model
We propose an effective Generalized Double Back-Propagation for any parametric model, augmented with a differential and local neighborhood machine learning framework for almost sure convergence. As an extension of section [7.1], we propose a general double back-propagation scheme (GDBS) using the Mean Squared Error (MSE), and for any (parametric) model with parameter ψ as : 1. Using the equation (in 50), apply any suitable optimization framework to obtain a general estimateψ 0 of ψ for all the model (in a first step).
2. Use again equation 50 to backpropagate the M SE(y i −ŷ i ) = y i −ŷ est,i 2 to updateψ 0 per observation as follows:ψ where ψ,t is the learning rate schedule [See C.1] for this gradient update of ψ at step t, andψ t,i is the value of ψ at iteration t for observation i.
So each training data has its own learning rate which is here set as ŷ est, (i,t) . This is valuable for each test data as follows:ŷ where for the k-th test data x test k the changes f ψ i,t are taken from the j-th training data y j which verify: where the operation Mean(u) for vector u is taken upon all dimension of u. The criteria used in optimization [43] can be modify for a distance dist for which each test data x test k is ensured to find an associate x train i in the training data with : where ε is a very small number. This presented framework will be called (ζ, dbs ) − GDBS, and augmented with the data Augmentation for Empirical Differentiation (DAED) framework in section 10, shall be called the dist-NN-(h)-TS-PMI-(l 1 , ζ, dbs ) − GDBS. When the model is truly differentiable, with assumption (F d ) being valid, the learning with this model is almost surely perfect. This simply means in one equation [57]:

The Infinite Zelda Stochastic Game.
This is a stochastic game framework we have derived for a direct application scheme of the Potts Shrinkage model coupled with the Shallow Potts Gibbs Models. We all know that the next generation of games will come from artificial intelligence (AI) networks (Hsu (2004), Hassabis (2017), DeCoste (1997), Granter et al. (2017)). The core idea of this part is inspired from Bowling & Veloso (2002) that have described a scalable learning algorithm for stochastic games, and mainly the equivalence between Potts Model and percolation (Essam (1979), Ding et al. (2012), Kemppainen & Smirnov (2019)). We present here the Infinite Zelda Game (IZG) that rely on the learning process of the Shallow Gibbs Network (SGN) model.
The random partitions can be seen as random graphs. Each partition from the random bond Potts models used for its simulation presents some connected components and isolated elements. With a given subset of data {(x i , y i )} 1≤i≤N , the graph generated at each step is equivalent to a network graph built from pair interactions between the neighbors. Regardless of the connectivity structure, the graph has v N state (s) if each data point is characterized by v possible states (or spins). Each of the v states are represented graphically by a color. At each step (a new distinct generated partition from a previous one), y i can move from state v 1 to v 2 , thanks to the clustering process, and its estimate from the Shallow Gibbs Network (SGN) is denotedŷ i . i. Estimatedŷ i is close (in distance metric) to real y i more than its assigned cluster S i estimated mean 1 s i j∈S i y j , with |S i | = s i . This possibility is called pos 1 .
ii. Estimatedŷ i is not close (in distance metric) to y i more than its its assigned cluster S i mean 1 s i j∈S i y j , with |S i | = s i . This possibility is called pos 2 .
We set each available dataset D s to represent a Mansion (the training part). In each Mansion, the player has to maximize its coins collection to be eligible to move out to another Mansion. Each generated partition proposes a conditional estimateŷ i for y i ∈ Y = {y i , 1 ≤ i ≤ q} through the Shallow Gibbs Network (SGN). The player has to find by mouse or finger click in the network graph, those colored points in the position pos 2 . Obviously, the isolated points (in white color) are always in position pos 2 .
To facilitate the game, we always declare the number b of data points in position pos 2 , and ask the player to find them in b e trials with b + 5 = b e . Because the partitions are generated by Swenden-Wang cuts (Barbu & Zhu (2003)), the game is set for the player to accumulate coins as many as he detects the right squares in position pos 2 , in few trials, and in a given recorded time t e .
The Goal of the Infinite Zelda Game (IZG): "Win or Learn More" The goal of the Infinite Zelda Game (IZG) is to maximize your coins through the mansions. The more you win, the quicker you can move to another mansion. One more trick is to be done each time before generating a partition (a new network graph): the player has to choose the number of dist-NN-(h)-TS-PMI-(l 1 , ζ, dbs ) − DBS updates for the Shallow Gibbs Network (SGN) characterising its power level. By default, the power is set to 5, and he has to buy more coins to set more power.
The more dist-NN-(h)-TS-PMI-(l 1 , ζ, dbs ) − DBS updates you set, the more you may win at each partition configuration, because it reduces the number of data points (y i ) in position pos 2 with a Potts Shrinkage constraint ≥ m ≥ 5: the more power level you set, the more you reduce the number of positions pos 2 to find. The Infinite Zelda Game (IZG) for the Shallow Gibbs Network (SGN) can be summarized in the WoLM principle ("Win or Learn More") which encourages convergence.    where Z is a lower triangular matrix with positive diagonal elements, Z is called the Cholesky factor of A, and Z is unique.

Appendix A. Other Experiments Results with the Shallow Gibbs Models
Application to Gaussian distribution sampling. The following example is from Williams & Rasmussen (2006). For a Gaussian variable y ∼ N (η, κ 0 ), the multivariate normal distribution has a joint probability density given by: p (y | η, κ 0 ) = (2π) −d/2 |κ 0 | −1/2 exp − 1 2 (y − η) T κ −1 0 (y − η) where η ∈ R d is the mean vector and κ 0 ∈ M d (R) is the (symmetric, positive definite) covariance matrix. To get some samples of y from this distribution, apply the following steps: • Compute the Cholesky decomposition: we want to compute the Cholesky decomposition of the covariance matrix κ 0 . That is, we want to find a lower triangular matrix Z ∈ M d (R) such that κ 0 = ZZ T . Matrix Z will be useful in a further step.
• Generate Independent Samples v ∼ N (0, I); • Compute y = η + Zv. The variable y = η + Zv has a multivariate normal distribution since is a linear combination of independent normally distributed variables. Moreover, Appendix C. The learning rate Definition C.1 (The learning rate schedule). This function (s) : N → R is called the learning rate schedule.

Appendix D. Stochastic gradient
The Stochastic gradient can be also ascent or descent. We will illustrate the standard properties of this algorithm using the descent configuration, because both (ascent & descent) have similar characteristics (or identical features).
The stochastic descent of the gradient (often shortened as SGD) is a stochastic approximation of the method of gradient descent to minimise an objective descent. A functionality that is written as a sum of distinguishable functions. The term stochastic here applies to the fact that we understand that we do not exactly know the gradient, but rather know a chaotic estimate of it instead.
This paragraph may require further readings about the Evidence Lower Bound (ELBO) in section 5 where more details have been proposed. Following Bottou (2012), the stochastic gradient descent algorithm replaces the gradient by an estimate : where we describe the estimate of the gradient by Z (λ n ; ξ n ) , with the optimized parameter λ, emphasizing the stochastic nature through the random vector ξ n . A class of possibilities are given by where S t ⊂ {1, . . . , n o } and n t = |S t | gives the number of observations to base the estimate of the gradient on. In our case, we have set n t ≤ 2 enabling high-speed computation, but requiring many iterations ≥ 2. We will consider the following assumptions on { n } : Theorem D.1 (Quasimartingale convergence theorem (Robbins & Siegmund, 1971) and (Fisk, 1965)) If (X n ) ∞ n=1 is a positive stochastic process, and then X n → X ∞ almost surely on a filtered probability space (Ω, F, (F n ) ∞ n=0 , P) , with F n = σ (X m | m ≤n) Now, let C : R k −→ R be differentiable. How do solutions s : R → R k to the following Ordinary differential Equation ( The discretization uses Forward Euler discretisation method (Villatoro & Ramos, 1999).
Step 1 Define Lyapunov sequence: h n = s n − x 2 2 . h n is not guaranteed to be decreasing. The main Idea from (Bottou (1998)) is: h n may fluctuate, but if we can show that the cumulative 'up' movements aren't too big, we can still prove convergence of h n .
Step 2 Consider the h n variations: h n+1 − h n . h n+1 − h n = s n+1 − x , s n+1 − x − s n − x , s n − x = s n+1 , s n+1 − s n , s n − 2 s n+1 − s n , x = s n − n H n (s n ) , s n − n H n (s n ) − s n , s n + 2 n H n (s n ) , x = −2 n s n − x , H n (s n ) + 2 n H n (s n ) 2 2 So E [h n+1 − h n | F n ] = −2 n s n − x , ∇C (s n ) + 2 n E H n (s n ) 2 2 | F n Step 3 Show that h n converge almost surely: Assuming E H n (x) 2 2 ≤ A + B x − x 2 2 , we get: h n+1 − h n ≤ −2 n s n − x , H n (s n ) + 2 n (A + Bh n ) =⇒ h n+1 − 1 + 2 n B h n ≤ −2 n s n − x , H n (s n ) + 2 n A ≤ 2 n A Condition (2) simply states that the opposite of the gradient −∇ x C(x) always points towards the minimum x * . This is also a convexity criterion that ensures that the term s n − x , H n (s n ) is always negative. We have: ∀ > 0, inf x−x 2 2 > x − x , ∇C(x) > 0 =⇒ s n − x , H n (s n ) > 0 So,we get: h n+1 − 1 + 2 n B h n ≤ 2 n A Introduce the series µ n = n−1 i=1 1 1+ 2 i B n→∞ −→ µ ∞ , and h n = µ n h n Get: n=1 converges a.s .
Step 4 Show that h n must converge to 0: From previous calculations: E h n+1 − 1 + 2 n B h n | F n = −2 n s n − x , ∇C (s n ) + 2 n A (h n ) ∞ n=1 converges, so the sequence in the first member of the equation above is summable a.s. . Because ∞ n=1 2 n < ∞, so right term ( 2 n A) is summable a.s., so the left term side is also summable a.s. : ∞ n=1 n s n − x , ∇C (s n ) < ∞ almost surely, and n s n − x , ∇C (s n ) −→ 0 almost surely We can conclude that (h n ) ∞ n=1 converge to zero, and this forces also to have: s n − x , ∇C (s n ) → 0 almost surely .
We can reach this conclusion because we know that h n converges. Reasoning by contradiction: let us assume that h n = s n − x 2 2 converges to a value greater than zero and therefore, after a certain time, remains greater than some > 0. Assumption (2) implies that s n − x , ∇C (s n ) > 0, and then it remains greater than a strictly positive quantity. Since this would cause the sum ( ∞ n=1 n s n − x , ∇C (s n ) ) to diverge (but, this is not the case: ∞ n=1 n s n − x , ∇C (s n ) < ∞ ), we can conclude on that h n converges to zero. We get by then: s n → x (and simultaneously s n − x , ∇C (s n ) → 0), which ends the proof.
Application on our stochastic gradient algorithm: If C is the ELBO function [Section 5] averaged across a data set, the true gradient is of the form: and an approximation is formed by subsampling: ∇C(λ n ) = 1 n t i∈St ∇ELBO i (λ n ) (S t ∼ Uniforme( subsets of size n t )) The only difference between the Stochastic Gradient Ascent (SGA) algorithm versus the Stochastic Gradient Descent (SGD), is that we want to maximize the ELBO function. For that purpose, we simply reformat maximizing the ELBO as minimizing its negative to apply the convergence theorem. In our case: i. The learning rate vector α t = (α 1 t , α 2 t , α 3 t ) defined in section 7.1 ( α 1 t for the weights, α 2 t for the biases and α 3 t for the covariance matrix Σ) is set to be decreasing to make conditions [(A − 1), (A − 2), and (A − 3)] hold in our experiments : ii. The Evidence Lower Bound (ELBO) is continuous and differentiable in every point, and is bounded by log p(y) (Yang, 2017). So in Proposition [see D.1], the condition (3) always holds for the ELBO. But, it is not the case for condition (1) and (2) because of non-convexity of the ELBO. In their seminal paper from 1951, Robbins and Monro showed that the stochastic optimization will converge to a local optimum in our case (Robbins & Monro (1951)). So, it requires good initialization.
In practice, the quality of the approximation depends on the variance of the estimator of Z (λ n ; ξ n ) (Johnson & Zhang, 2013). The main advantage of this algorithm is that one can even estimate the model with a cluster of small size, and still get good estimations. The Robbins-Siegmund theorem (Robbins & Siegmund, 1971) provides the means to establish almost sure convergence under conditions including (A − 2), also in cases where the loss function is non-smooth (Saad, 1998).
The size of the subset used to measure the gradient can be considered in the same manner as we think of sample size in simple estimation problem. Big mini-batch sizes can have reliable gradient forecasts, reducing the parameters update variances. Small mini-batches, by comparison, are easy to estimate.