This repository contains experiments based on the paper by Y. K. Foong et al, 2018. The authors of the paper presented results about Variational Inference in Bayesian neural networks for the fundamental problem of uncertainty quantification. They showed some problems with the most popular approximate inference families: Mean Field Variational Inference and Monte Carlo Dropout.
Here you can find a brief description of the experiments implemented in this repository.
We start from our synthetic 1D regression dataset:
Random seed was for both
and
. Unfortunately, the authors do not write anything about the regression function and the data generation process, so we made this choice in our experiments on our own. The task is to recover the unknown regression function and its uncertainty using 4 methods: Gaussian Processes (GP), Hamiltonian Monte Carlo (HMC), Variational inference with 2 approximate families: Mean Field Variational Inference (MVFI) and Monte Carlo Dropout (MCDO).
In this experiment we learn GP using GPflow library. We trained GP with maxiter = 100 with the following parameters: kernel Matern52 with known variance variance = 0.01 and lengthscales lengthscales = 0.3. One of its most natural properties is the increase in uncertainty (standard deviation) in the regions between training data points.
For all experiments we used a multi-layer fully-connected ReLU network with 50 hidden units on each hidden layer. We assume that the conditional distribution of target is , where
is constant for all observations and
is the value provided as ground-truth. The prior for mean is set to zero for all parameters. The standard deviation of biases is set to one. Suppose, that there is layer
with
inputs and
. For each layer
we used
for the prior standard deviation of each weight with
(array
denotes
for each depth of BNN). We will describe our
choice for each experiment. In original paper authors use
. According to [Tomczak et al., 2018] we initalize set biases mean to zero and standard deviation to one, weights standard deviations are set to
and their means are independent samples from
for the layer
.
We use BNN with 1 layer and . We train NUTS with 5 parallel chains of MCMC and 300 samples from each chain for the distribution estimation and 300 samples for warmup. Result predicition is based on ensemble of these 1500 models of NUTS generated sets of weights. In Pyro we set random seed as pyro.set_rng_seed(1) before BNN initialization. We compare our result with training simple deterministic of neural network with the same architecture. For this NN we used Adam optimizer with
, MSE loss and num epochs = 1000. We see that deterministic network tends to fit data worse than Bayesian and the Bayesian setting gives smoother results. Results are shown in the 2 figures: the top figure for the NUTS method and the bottom figure for the deterministic neural network. It can be seen, that the uncertainty is higher in the region between two clusters.
In this experiment we train 1 layer BNN with the same setup using MFVI approximation family with ELBO loss. We estimate ELBO with only one sample (num particles = 1), as we discovered that it speeds up the convergence, together with reducing computation time per sample. We trained it using SVI class in Pyro library with Adam optimizer, for num epochs = 30000 and batch size equal to the whole dataset size. We set random seed as pyro.set_rng_seed(1) before training process. Firstly, we show results for different prior choice
from top to bottom. We can see, that the plain approximator is sensitive to the target’s prior scale.
We see that optimization process can be unsuccessful for very small prior weight standard deviation: training process is smooth without steps, which indicates only training for uncertainty, but not for the source data in term of mean prediction. The first picture shows that the neural network cannot describe the data well, although the optimization process has converged. We demonstrate loss graphs for the cases and
from top to bottom.
We emphasize, that Local Reparametrization Trick [Kingma et al., 2015] was used in the original paper. It is believed to simplify the training process due to smaller covariances between the gradients in one batch and, what is more im-
portant, it makes computations more efficient. We implemented this method by scratch. We demonstrate our results for different prior choice (from top to bottom) with the same setup as for the custom model.
We see that Local Reparametrization Trick poses independence from the prior scale selection. Even though it can make the optimization process more robust, it definitely gives us less control. We compare deeper models: consider the same architecture for the custom model and Local Reparametrization Trick, but with 2 layers. The results are presented in the following figures, top figure corresponds to Local Reparametrization Trick with and the bottom figure corresponds to the custom model with
.
Even though it is easier for them to fit the data, there is no significant change in the uncertainty estimation. We emphasize that the usual notion of stacking layers to boost the models complexity doesn’t apply here, so we should keep looking for other approximation techniques.
Our synthetic 2D classification dataset via regression is the following:
We trained 2 layers BNN with the same setup for NUTS as for 1D regression task. We end up with the posterior depicted in the following figures: Unfortunately, we didn't obtain good results for this case. The overall scale of the dataset is the same, as in the regression task, thus it seems possible for the Bayesian model to fit the data given correct hyperparameters, but not for this set of hyperparameters.We present our synthetic 2D regression dataset. Consider two clusters of points with centers in
and
with 100 points in each cluster drawn from normal distributions with standard deviation equal
. This will be the input variables for our model. The target is simply the evaluation of
at these points. Our objective is the uncertainty (standard deviation or variance) predicted by the model for the set
.
In these experiments we used our own implementation of BNNs with MFVI and MCDO approximation families based only on PyTorch framework. We had to implement BNNs and losses from scratch.
This model is analogous to the original with the only distinction, that the conditional variance is unknown and non constant ( in previous examples). Namely, we assume that the conditional distribution of target is given by:
, i.e. the variance is predicted by the network. In this model the uncertainty is measured by
and not by
as in the previous cases. In that case BNN has 2 outputs: first is the mean of normal distribution
and the second is
. We will describe how
is connected with
in the section Custom ELBO loss.
We used the following formulas for forward method in BNNs when we sample weights for current layer :
where is an matrix with size
of learned means for elements in weight matrix
;
is an matrix of learned parameters with size
that determines standard deviations for weight matrix
(matrix
is obtained element-wise from the matrix
);
is an matrix with size
such as all elements of
are sampled from
independently. The same notation is used for
:
is an vector with size
of learned means for elements in bias vector
;
is an vector of learned parameters with size
that determines standard deviations for bias vector
(vector
is obtained element-wise from the vector
);
is an vector with size
such as all elements of
are sampled from
independently.
During testing for the model with constant conditional variance we calculate std for BNN output as Monte Carlo estimation using 128 samples of numerical stds for according to [Y. K. Foong et al, 2018]. For the model with unknown conditional variance (also during testing) we calculate std for BNN output as Monte Carlo estimation using 128 samples of
. Also for testing we use BNNs mean output as Monte Carlo estimation using 128 samples of
. During training these objectives were estimated using 32 Monte Carlo samples according to [Y. K. Foong et al, 2018].
We implemented an ELBO loss with batches calculation:
can be rewritten aswhere is a size of train dataset.
The expectation for log-density part in ELBO loss was estimated using 32 Monte Carlo samples during training according to [Y. K. Foong et al, 2018]. For variance prediction models we used a trick for numerical stability. When calculating log-density Monte-Carlo estimation we implement the following formulas to obtain std prediction using the second BNNs output
:
We used in all experiments. So, for
we rely on
.
We trained BNNs on the above regression 2D dataset for 1000 epochs with Adam optimizer with . We present results for training BNNs with 1 and 2 layers for 2 modes: with constant conditional variance
in the model of data
and with unknown conditional variance in the model of data
. We used stds estimation that was described in the section Details of BNNs implementation from scratch. Firstly, we show results for 1 layer BNNs, top figure corresponds to the model with constant conditional variance and the bottom figure corresponds to the model with unknown conditional variance.
The same results for 2 layers BNNs are presented in the following figures:
We see that for MFVI with constant conditional variance the in-between clusters uncertainty is lower. However, we can not see that the uncertainty function is convex: convexity is preserved only on some line segments as Theorem 1 in [Y. K. Foong et al, 2018] claims. We also see that for 2 hidden layers network the convexity is preserved over the smaller area, compared to 1 hidden layer network. This gives us an intuition that increasing the number of hidden layers may reduce convexity of uncertainty function.
It was shown in [Gal & Ghahramani, 2016] that maximizing ELBO with MCDO family is equivalent to minimizing
where is the vector of all target values in the training dataset,
is the matrix of input variables,
is the vector of expectations of BNN predictions,
is the number of fully-connected layers,
and
are weights and biases of the
layer, for a properly chosen
. Also it was shown that in order to treat Dropout as Bayesian inference we should choose
by the formula:
where is the dropout probability,
is the reciprocal of the prior variance for the weigths in the first fully-connected layer,
is the size of the training data and
is the conditional variance of
. According to [Y. K. Foong et al, 2018] we used
. Also we used
and according to notation in [Gal, 2016]
.
In this experiment we train BNNs on the regression 2D dataset for 1000 epochs with Adam optimizer with using MCDO approximation family with MCDO loss discussed above. We used stds estimation that was described in the section Details of BNNs implementation from scratch. We show results for BNN with 1 and 2 layers from top to bottom.
We can observe that convexity is preserved in both cases.
Now we show result for regression task in the section Regression 2D dataset using Gaussian Processes from from sklearn.gaussian_process. We used GaussianProcessRegressor with as a kernel and n_restarts_optimizer=9 parameter. Parameters for
kernel were the following:
. Parameters for
kernel were the following:
.
We see that the level lines of uncertainty are not convex and display the desirable property: the further away we are from the training points the less confident we are in the prediction. There is no problems with in-between uncertainty or convexity in general.
In this section we are going to the hypothesis that the poor uncertainty prediction is due to improper architecture of neural networks or not. For this task we will try to fit BNNs with certain functions, which play the role of mean and variance functions.
We consider functions with
and
on the segment
. We will fit different networks using the
following
loss function:
Our train dataset consists of points according to [Y. K. Foong et al, 2018] with values of the functions
in these points. These points were randomly sampled from the uniform grid of 1000 points on the segment
.
We train BNNs for 6000 epochs with Adam optimizer, using MFVI approximation family with
loss from the section Mean/variance functions and loss. We present results for training BNNs with 1 and 2 layers for the mode with constant conditional variance. We used stds and means estimations that were described in the section Details of BNNs implementation from scratch. We show results for 1 layer and 2 layers BNNs, top 2 figures correspond to BNN with 1 layer and the bottom 2 figures correspond to BNN with 2 layers.
axis means indexes of points in grid on the segment
.
We see that 1 layer MFVI BNN predicts mean relatively good, but not variance. This means that one layer network architecture is not powerful enough. However for 2 layer MFVI BNN both mean and variance the predictions are quite good. This confirms the Theorem 3 in [Y. K. Foong et al, 2018].
The same conclusion can be done for MCDO approximation family. We also trained MCDO BNNs for 6000 epochs with Adam optimizer, and
loss from the section Mean/variance functions and loss. We used stds and means estimations that were described in the section Details of BNNs implementation from scratch for the mode with constant conditional variance. We show results for 1 layer and 2 layers BNNs, top 2 figures correspond to BNN with 1 layer and the bottom 2 figures correspond to BNN with 2 layers.
axis means indexes of points in grid on the segment
.
We have run the experiments on Linux. The versions are given in brackets. The following packages are used in the implementation:
- PyTorch (1.4.0)
- NumPy (1.17.3)
- scikit-learn (0.22.1)
- matplotlib (3.1.2)
- tqdm (4.39.0)
- Pyro (1.3.1)
- GPflow (2.0.4)
- TensorFlow (2.1.0) as dependency for GPflow
You can use pip
or conda
to install them.
All the experiments can be found in the underlying notebooks:
Notebook | Description |
---|---|
notebooks/develop.ipynb | HMC, Local Reparametrization Trick, prior tuning: experiments with HMC, Local Reparametrization Trick and prior tuning for MFVI BNNs. |
notebooks/Experiments.ipynb | 2D regression, MCDO and MFVI from scratch: experiments with our own implementation of losses and training BNNs using PyTorch. |
notebooks/BNN_start.ipynb | MNIST, GPflow and MVFI BNNs using Pyro: experiments with MNIST dataset, 1D regression task with GPflow and MFVI BNNs using Pyro primitives and pyro.nn.Module. |
For convenience, we have also implemented a framework and located it correspondingly in bnn-vi/bnn-vi, bnn-vi/notebooks/api and bnn-vi/notebooks/Models_and_losses.py.
At the moment we are Skoltech DS MSc, 2019-2021 students.
- Artemenkov Aleksandr
- Karpikov Igor
- Selikhanovych Daniil