Skip to content

Daniil-Selikhanovych/bnn-vi

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

51 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

bnn-vi

This repository contains experiments based on the paper by Y. K. Foong et al, 2018. The authors of the paper presented results about Variational Inference in Bayesian neural networks for the fundamental problem of uncertainty quantification. They showed some problems with the most popular approximate inference families: Mean Field Variational Inference and Monte Carlo Dropout.

What's inside?

Here you can find a brief description of the experiments implemented in this repository.

Regression 1D dataset

Regression dataset

We start from our synthetic 1D regression dataset:

Random seed was for both and . Unfortunately, the authors do not write anything about the regression function and the data generation process, so we made this choice in our experiments on our own. The task is to recover the unknown regression function and its uncertainty using 4 methods: Gaussian Processes (GP), Hamiltonian Monte Carlo (HMC), Variational inference with 2 approximate families: Mean Field Variational Inference (MVFI) and Monte Carlo Dropout (MCDO).

Baseline 1: Gaussian Processes

Gaussian Processes

In this experiment we learn GP using GPflow library. We trained GP with maxiter = 100 with the following parameters: kernel Matern52 with known variance variance = 0.01 and lengthscales lengthscales = 0.3. One of its most natural properties is the increase in uncertainty (standard deviation) in the regions between training data points.

Bayesian neural network architecture and setup

For all experiments we used a multi-layer fully-connected ReLU network with 50 hidden units on each hidden layer. We assume that the conditional distribution of target is , where is constant for all observations and is the value provided as ground-truth. The prior for mean is set to zero for all parameters. The standard deviation of biases is set to one. Suppose, that there is layer with inputs and . For each layer we used for the prior standard deviation of each weight with (array denotes for each depth of BNN). We will describe our choice for each experiment. In original paper authors use . According to [Tomczak et al., 2018] we initalize set biases mean to zero and standard deviation to one, weights standard deviations are set to and their means are independent samples from for the layer .

Baseline 2: Hamiltonian Monte Carlo

Hamiltonian Monte Carlo

Deterministic neural network

We use BNN with 1 layer and . We train NUTS with 5 parallel chains of MCMC and 300 samples from each chain for the distribution estimation and 300 samples for warmup. Result predicition is based on ensemble of these 1500 models of NUTS generated sets of weights. In Pyro we set random seed as pyro.set_rng_seed(1) before BNN initialization. We compare our result with training simple deterministic of neural network with the same architecture. For this NN we used Adam optimizer with , MSE loss and num epochs = 1000. We see that deterministic network tends to fit data worse than Bayesian and the Bayesian setting gives smoother results. Results are shown in the 2 figures: the top figure for the NUTS method and the bottom figure for the deterministic neural network. It can be seen, that the uncertainty is higher in the region between two clusters.

Variational inference: MFVI

Custom model

In this experiment we train 1 layer BNN with the same setup using MFVI approximation family with ELBO loss. We estimate ELBO with only one sample (num particles = 1), as we discovered that it speeds up the convergence, together with reducing computation time per sample. We trained it using SVI class in Pyro library with Adam optimizer, for num epochs = 30000 and batch size equal to the whole dataset size. We set random seed as pyro.set_rng_seed(1) before training process. Firstly, we show results for different prior choice from top to bottom. We can see, that the plain approximator is sensitive to the target’s prior scale.

Prior weight standard deviation 0.1

Prior weight standard deviation 1

Prior weight standard deviation 10

We see that optimization process can be unsuccessful for very small prior weight standard deviation: training process is smooth without steps, which indicates only training for uncertainty, but not for the source data in term of mean prediction. The first picture shows that the neural network cannot describe the data well, although the optimization process has converged. We demonstrate loss graphs for the cases and from top to bottom.

Loss for prior weight standard deviation 0.1

Loss for prior weight standard deviation 10

Model with Local Reparametrization Trick

We emphasize, that Local Reparametrization Trick [Kingma et al., 2015] was used in the original paper. It is believed to simplify the training process due to smaller covariances between the gradients in one batch and, what is more im- portant, it makes computations more efficient. We implemented this method by scratch. We demonstrate our results for different prior choice (from top to bottom) with the same setup as for the custom model.

Prior weight standard deviation 0.1 with trick

Prior weight standard deviation 1 with trick

Prior weight standard deviation 10 with trick

We see that Local Reparametrization Trick poses independence from the prior scale selection. Even though it can make the optimization process more robust, it definitely gives us less control. We compare deeper models: consider the same architecture for the custom model and Local Reparametrization Trick, but with 2 layers. The results are presented in the following figures, top figure corresponds to Local Reparametrization Trick with and the bottom figure corresponds to the custom model with .

Prior weight standard deviation 1 with trick for 2 layers

Prior weight standard deviation 10 for 2 layers

Even though it is easier for them to fit the data, there is no significant change in the uncertainty estimation. We emphasize that the usual notion of stacking layers to boost the models complexity doesn’t apply here, so we should keep looking for other approximation techniques.

Classification 2D dataset and HMC

Our synthetic 2D classification dataset via regression is the following:

We trained 2 layers BNN with the same setup for NUTS as for 1D regression task. We end up with the posterior depicted in the following figures:

Classification HMC pool

Classification HMC contourf

Unfortunately, we didn't obtain good results for this case. The overall scale of the dataset is the same, as in the regression task, thus it seems possible for the Bayesian model to fit the data given correct hyperparameters, but not for this set of hyperparameters.

Regression 2D dataset

We present our synthetic 2D regression dataset. Consider two clusters of points with centers in and with 100 points in each cluster drawn from normal distributions with standard deviation equal . This will be the input variables for our model. The target is simply the evaluation of at these points. Our objective is the uncertainty (standard deviation or variance) predicted by the model for the set .

Variance prediction from model and losses from scratch

In these experiments we used our own implementation of BNNs with MFVI and MCDO approximation families based only on PyTorch framework. We had to implement BNNs and losses from scratch.

Model with unknown conditional variance

This model is analogous to the original with the only distinction, that the conditional variance is unknown and non constant ( in previous examples). Namely, we assume that the conditional distribution of target is given by: , i.e. the variance is predicted by the network. In this model the uncertainty is measured by and not by as in the previous cases. In that case BNN has 2 outputs: first is the mean of normal distribution and the second is . We will describe how is connected with in the section Custom ELBO loss.

Details of BNNs implementation from scratch

We used the following formulas for forward method in BNNs when we sample weights for current layer :

where is an matrix with size of learned means for elements in weight matrix ; is an matrix of learned parameters with size that determines standard deviations for weight matrix (matrix is obtained element-wise from the matrix ); is an matrix with size such as all elements of are sampled from independently. The same notation is used for : is an vector with size of learned means for elements in bias vector ; is an vector of learned parameters with size that determines standard deviations for bias vector (vector is obtained element-wise from the vector ); is an vector with size such as all elements of are sampled from independently.

During testing for the model with constant conditional variance we calculate std for BNN output as Monte Carlo estimation using 128 samples of numerical stds for according to [Y. K. Foong et al, 2018]. For the model with unknown conditional variance (also during testing) we calculate std for BNN output as Monte Carlo estimation using 128 samples of . Also for testing we use BNNs mean output as Monte Carlo estimation using 128 samples of . During training these objectives were estimated using 32 Monte Carlo samples according to [Y. K. Foong et al, 2018].

ELBO loss from scratch

We implemented an ELBO loss with batches calculation:

can be rewritten as

where is a size of train dataset.

The expectation for log-density part in ELBO loss was estimated using 32 Monte Carlo samples during training according to [Y. K. Foong et al, 2018]. For variance prediction models we used a trick for numerical stability. When calculating log-density Monte-Carlo estimation we implement the following formulas to obtain std prediction using the second BNNs output :

We used in all experiments. So, for we rely on .

MFVI using ELBO loss

We trained BNNs on the above regression 2D dataset for 1000 epochs with Adam optimizer with . We present results for training BNNs with 1 and 2 layers for 2 modes: with constant conditional variance in the model of data and with unknown conditional variance in the model of data . We used stds estimation that was described in the section Details of BNNs implementation from scratch. Firstly, we show results for 1 layer BNNs, top figure corresponds to the model with constant conditional variance and the bottom figure corresponds to the model with unknown conditional variance.

1 HL BNN with constant conditional variance

1 HL BNN with unknown conditional variance

The same results for 2 layers BNNs are presented in the following figures:

2 HL BNN with constant conditional variance

2 HL BNN with unknown conditional variance

We see that for MFVI with constant conditional variance the in-between clusters uncertainty is lower. However, we can not see that the uncertainty function is convex: convexity is preserved only on some line segments as Theorem 1 in [Y. K. Foong et al, 2018] claims. We also see that for 2 hidden layers network the convexity is preserved over the smaller area, compared to 1 hidden layer network. This gives us an intuition that increasing the number of hidden layers may reduce convexity of uncertainty function.

MCDO loss from scratch

It was shown in [Gal & Ghahramani, 2016] that maximizing ELBO with MCDO family is equivalent to minimizing

where is the vector of all target values in the training dataset, is the matrix of input variables, is the vector of expectations of BNN predictions, is the number of fully-connected layers, and are weights and biases of the layer, for a properly chosen . Also it was shown that in order to treat Dropout as Bayesian inference we should choose by the formula:

where is the dropout probability, is the reciprocal of the prior variance for the weigths in the first fully-connected layer, is the size of the training data and is the conditional variance of . According to [Y. K. Foong et al, 2018] we used . Also we used and according to notation in [Gal, 2016] .

MCDO results

In this experiment we train BNNs on the regression 2D dataset for 1000 epochs with Adam optimizer with using MCDO approximation family with MCDO loss discussed above. We used stds estimation that was described in the section Details of BNNs implementation from scratch. We show results for BNN with 1 and 2 layers from top to bottom.

1 HL MCDO BNN

2 HL MCDO BNN

We can observe that convexity is preserved in both cases.

Baseline for regression 2D dataset: Gaussian Processes

Now we show result for regression task in the section Regression 2D dataset using Gaussian Processes from from sklearn.gaussian_process. We used GaussianProcessRegressor with as a kernel and n_restarts_optimizer=9 parameter. Parameters for kernel were the following: . Parameters for kernel were the following: .

Gaussian Processes 2D

We see that the level lines of uncertainty are not convex and display the desirable property: the further away we are from the training points the less confident we are in the prediction. There is no problems with in-between uncertainty or convexity in general.

Learning BNNs for given mean and variance functions

In this section we are going to the hypothesis that the poor uncertainty prediction is due to improper architecture of neural networks or not. For this task we will try to fit BNNs with certain functions, which play the role of mean and variance functions.

Mean/variance functions and loss

We consider functions with and on the segment . We will fit different networks using the following loss function:

Train dataset

Our train dataset consists of points according to [Y. K. Foong et al, 2018] with values of the functions in these points. These points were randomly sampled from the uniform grid of 1000 points on the segment .

MFVI training

We train BNNs for 6000 epochs with Adam optimizer, using MFVI approximation family with loss from the section Mean/variance functions and loss. We present results for training BNNs with 1 and 2 layers for the mode with constant conditional variance. We used stds and means estimations that were described in the section Details of BNNs implementation from scratch. We show results for 1 layer and 2 layers BNNs, top 2 figures correspond to BNN with 1 layer and the bottom 2 figures correspond to BNN with 2 layers. axis means indexes of points in grid on the segment .

1 HL MFVI BNN mean prediction

1 HL MFVI BNN var prediction

2 HL MFVI BNN mean prediction

2 HL MFVI BNN var prediction

We see that 1 layer MFVI BNN predicts mean relatively good, but not variance. This means that one layer network architecture is not powerful enough. However for 2 layer MFVI BNN both mean and variance the predictions are quite good. This confirms the Theorem 3 in [Y. K. Foong et al, 2018].

MCDO training

The same conclusion can be done for MCDO approximation family. We also trained MCDO BNNs for 6000 epochs with Adam optimizer, and loss from the section Mean/variance functions and loss. We used stds and means estimations that were described in the section Details of BNNs implementation from scratch for the mode with constant conditional variance. We show results for 1 layer and 2 layers BNNs, top 2 figures correspond to BNN with 1 layer and the bottom 2 figures correspond to BNN with 2 layers. axis means indexes of points in grid on the segment .

1 HL MCDO BNN mean prediction

1 HL MCDO BNN var prediction

2 HL MCDO BNN mean prediction

2 HL MCDO BNN var prediction

Requirements

We have run the experiments on Linux. The versions are given in brackets. The following packages are used in the implementation:

You can use pip or conda to install them.

Contents

All the experiments can be found in the underlying notebooks:

Notebook Description
notebooks/develop.ipynb HMC, Local Reparametrization Trick, prior tuning: experiments with HMC, Local Reparametrization Trick and prior tuning for MFVI BNNs.
notebooks/Experiments.ipynb 2D regression, MCDO and MFVI from scratch: experiments with our own implementation of losses and training BNNs using PyTorch.
notebooks/BNN_start.ipynb MNIST, GPflow and MVFI BNNs using Pyro: experiments with MNIST dataset, 1D regression task with GPflow and MFVI BNNs using Pyro primitives and pyro.nn.Module.

For convenience, we have also implemented a framework and located it correspondingly in bnn-vi/bnn-vi, bnn-vi/notebooks/api and bnn-vi/notebooks/Models_and_losses.py.

Our team

At the moment we are Skoltech DS MSc, 2019-2021 students.

  • Artemenkov Aleksandr
  • Karpikov Igor
  • Selikhanovych Daniil

About

This repository contains experiments from the papers https://arxiv.org/pdf/1909.00719.pdf, https://arxiv.org/pdf/1806.00667.pdf.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •