JuliaCon 2022
Patrick Altmeyer
Don’t put all your 🥚 in one 🧺.
[…] parameters correspond to a diverse variety of compelling explanations for the data. (Wilson 2020)
\(\theta\) is a random variable. Shouldn’t we treat it that way?
\[ p(y|x,\mathcal{D}) = \int p(y|x,\theta)p(\theta|\mathcal{D})d\theta \qquad(1)\]
Intractable!
In practice we typically rely on a plugin approximation (Murphy 2022).
\[ p(y|x,\mathcal{D}) = \int p(y|x,\theta)p(\theta|\mathcal{D})d\theta \approx p(y|x,\hat\theta) \qquad(2)\]
Yes, “plugin” is literal … can we do better?
Yes, we can!
MCMC (see Turing)
Variational Inference (Blundell et al. 2015)
Monte Carlo Dropout (Gal and Ghahramani 2016)
Deep Ensembles (Lakshminarayanan, Pritzel, and Blundell 2016)
We first need to estimate the weight posterior \(p(\theta|\mathcal{D})\) …
Idea 💡: Taylor approximation at the mode.
Now we can rely on MC or Probit Approximation to compute posterior predictive (classification).
LaplaceRedux.jl - a small package 📦What started out as my first coding project Julia …
… has turned into a small package 📦 with great potential.
LaplaceRedux.jl and another blog post.We assume a Gaussian prior for our weights …
\[ p(\theta) \sim \mathcal{N} \left( \theta | \mathbf{0}, \lambda^{-1} \mathbf{I} \right)=\mathcal{N} \left( \theta | \mathbf{0}, \mathbf{H}_0^{-1} \right) \qquad(3)\]
… which corresponds to logit binary crossentropy loss with weight decay:
\[ \ell(\theta)= - \sum_{n}^N [y_n \log \mu_n + (1-y_n)\log (1-\mu_n)] + \\ \frac{1}{2} (\theta-\theta_0)^T\mathbf{H}_0(\theta-\theta_0) \qquad(4)\]
For Logistic Regression we have the Hessian in closed form (p. 338 in Murphy (2022)):
\[ \nabla_{\theta}\nabla_{\theta}^\mathsf{T}\ell(\theta) = \frac{1}{N} \sum_{n}^N(\mu_n(1-\mu_n)\mathbf{x}_n)\mathbf{x}_n^\mathsf{T} + \mathbf{H}_0 \qquad(5)\]
An actual MLP …
Low prior uncertainty \(\rightarrow\) posterior dominated by prior. High prior uncertainty \(\rightarrow\) posterior approaches MLE.
We’re really been using linearized neural networks …
Applying the GNN approximation […] turns the underlying probabilistic model locally from a BNN into a GLM […] Because we have effectively done inference in the GGN-linearized model, we should instead predict using these modified features. — Immer, Korzepa, and Bauer (2020)
Learn about Laplace Redux by implementing it in Julia.
Turn code into a small package.
Submit to JuliaCon 2022 and share the idea.
Package is bare-bones at this point and needs a lot of work.
Effortless Bayesian Deep Learning through Laplace Redux – JuliaCon 2022 – Patrick Altmeyer