Modeling states (ground or thermal) in computational physics requires calculating the partition function - an expression that scales exponentially with the number of constituents and is thus generally intractable. As a workaround, physicists commonly use probabilistic sampling methods, many of which are based on a Markov Chain (MC). A huge drawback about MC-based sampling is that the equilibration time required to generate uncorrelated samples can be very long.
Restricted Boltzmann Machines (RBMs) are a class of generative models that have many appealing properties as a tool for statistical physics, yet it too is burdened by a MC-like procedure called Gibbs sampling.
In this post, we discuss an alternative that can be used for the purpose of state modeling: Neural Autoregressive Distributions Estimators (NADEs). The NADE is a generative model which is inspired by the RBM architecture, but unlike the RBM, it does not employ a MC-based sampling method. Algorithms wherein the partition function need not be calculated, yet the probability distribution defined by the model can be directly sampled, are called autoregressive.
The probability of a sample's occurence, as modelled by an RBM, requires the calculation of the partition function, which is intractable. Recall that for an RBM,
$$ Z = \sum_{\mathbf{h} \in \mathcal{H}{\mathbf{h}}} \sum{\mathbf{v} \in \mathcal{H}_{\mathbf{v}}} e^{-E(\mathbf{v},\mathbf{h})}, $$
and
where
If we can write the probability distribution defined by the RBM as a product of tractable conditionals like that of autoregressive models, then we can bypass calculating
$$ p(\mathbf{v}) = \prod_{i} p(v_i \vert \mathbf{v}{<i}) = \prod{i} \frac{p(v_i, \mathbf{v}{ \lt i})}{p(\mathbf{v}{ \lt i})}. $$
However, $p(v_i, \mathbf{v}{ \lt i})$ nor $p(\mathbf{v}{ \lt i})$ are tractable. If we can approximate both quantities, then there might be instances where the above expression is tractable and we've made the RBM autoregressive.
Consider a mean-field approach for the approximation (recall that a mean-field approximation just relates to the idea that our variables are independent, e.g. $p(a,b) = p(a)p(b)$): approximate $p(v_i \vert \mathbf{v}{<i})$ by finding a tractable approximation for $p(v_i, \mathbf{v}{>i}, \mathbf{h} \vert \mathbf{v}{<i}) \approx q(v_i, \mathbf{v}{>i}, \mathbf{h} \vert \mathbf{v}{<i})$ such that $q(v_i \vert \mathbf{v}{<i})$ is easily obtainable. In our mean-field approximation for $p(v_i, \mathbf{v}{>i}, \mathbf{h} \vert \mathbf{v}{<i})$,
$$ \begin{aligned} q(v_i, \mathbf{v}{>i}, \mathbf{h} \vert \mathbf{v}{<i}) =& q(v_i \vert \mathbf{v}{ \lt i}) q(\mathbf{v}{>i} \vert \mathbf{v}{<i}) q(\mathbf{h} \vert \mathbf{v}{>i}) \ =& q(v_i \vert \mathbf{v}{<i}) \prod{j > i} q(v_j \vert \mathbf{v}{<i}) \prod_k q(h_k \vert \mathbf{v}{ \lt i}). \end{aligned} $$
Noting that
We may assume a similar form for
Therefore,
$$ \begin{aligned} q(v_i, \mathbf{v}{>i}, \mathbf{h} \vert \mathbf{v}{<i}) =& \mu_i(i)^{v_i} (1-\mu_i(i))^{1-v_i} \prod_{j > i} \mu_j(i)^{v_j} (1-\mu_j(i))^{1-v_j} \prod_k \tau_k(i)^{h_k} (1-\tau_k(i))^{1-h_k}, \end{aligned} $$
with
$$ \begin{aligned} \mu_j(i) = & q(v_i = 1 \vert \mathbf{v}{<i}) \approx p(v_i = 1 \vert \mathbf{v}{\lt i}), \ \tau_k(i) = & q(h_k = 1 \vert \mathbf{v}{\lt i}) \approx p(h_k = 1 \vert \mathbf{v}{\lt i}). \end{aligned} $$
But what are
Note that these expressions depend on their counterparts (i.e.
Unfortunately, in making the RBM (approximately) autoregressive we also encounter computational bottlenecks. To determine
Let's build off of this mean-field idea presented in the previous section. However, it's at this point where the physical ties to an RBM vanish. How the NADE architecture is formed is simply "inspired by" this mean-field approximation for an RBM.
For one iteration with
If you stare at this long enough, you'll realize that this is simply many feed-forward networks, each with one hidden layer and shared weights going across the networks! There are
It turns out to be computationally benefitial to train a seperate set of weights connecting the output of the networks with the hidden layers, rather than having a shared weight matrix. Each of the
$$ \begin{aligned} \mathrm{input} = & \mathbf{v}{\lt i} \ \mathrm{activation} = & \mathbf{h}i = \mathrm{sigm}(\mathbf{c} + \mathbf{W}{\cdot, \lt i} \mathbf{v}{\lt i}) \ \mathrm{output} = & p(v_i = 1 \vert \mathbf{v}{\lt i}) = \mathrm{sigm}(b_i + \mathbf{U}{i,\cdot} \mathbf{h}_i) \end{aligned} $$
There is a lot of reusing parameters in each of the
- For
$N$ sites, there are$N$ hidden layers that each comprise of$n_h$ neurons. -
$\mathbf{W}$ ($n_h \times N$ matrix) and$\mathbf{c}$ are shared by all$N$ networks. -
$\mathbf{U}$ is$N \times n_h$ .
Training the NADE boils down to minimizing the negative log-likelihood of the parameters given the training set.
The autoregressive probability,
$$ \begin{aligned} &p(\mathbf{v}) = 1 \ &\mathbf{a} = \mathbf{c} \ &\text{for i in 1:N} \ &\qquad \mathbf{h}i \leftarrow sigm(\mathbf{a}) \ &\qquad p(v_i = 1 \vert \mathbf{v}{\lt i}) \leftarrow sigm(b_i + \mathbf{U}{i,\cdot}\mathbf{h}i) \ &\qquad p(\mathbf{v}) \leftarrow p(\mathbf{v}) \times p(v_i = 1 \vert \mathbf{v}{\lt i})^{v_i} \times \left( 1 - p(v_i = 1 \vert \mathbf{v}{\lt i}) \right)^{1-v_i} \ &\qquad \mathbf{a} \leftarrow \mathbf{a} + \mathbf{W}_{\cdot,i} v_i \
&\text{return} \qquad p(\mathbf{v})
\end{aligned} $$
Now, we can calculate gradients of the NLL w.r.t. the NADE parameters (
$$ \begin{aligned} &\delta \mathbf{a} = 0 \ &\delta \mathbf{c} = 0 \ &\text{for i in 1:N} \ &\qquad \delta b_i \leftarrow p(v_i = 1 \vert \mathbf{v}{<i}) - v_i \ &\qquad \delta \mathbf{U}{i, \cdot} \leftarrow \left( p(v_i = 1 \vert \mathbf{v}{<i}) - v_i \right) \mathbf{h}{i} \ &\qquad \delta \mathbf{h}i \leftarrow \left( p(v_i = 1 \vert \mathbf{v}{<i}) - v_i \right) \mathbf{U}_{i, \cdot} \ &\qquad \delta \mathbf{c} \leftarrow \delta \mathbf{c} + \delta \mathbf{h}_i \bigodot \mathbf{h}_i \bigodot (1 - \mathbf{h}i) \ &\qquad \delta \mathbf{W}{\cdot,i} \leftarrow \delta \mathbf{a} v_i \ &\qquad \delta \mathbf{a} \leftarrow \delta \mathbf{a} + \delta \mathbf{h}_i \bigodot \mathbf{h}_i \bigodot (1 - \mathbf{h}_i) \ &\text{return} \qquad \delta \mathbf{b}, \delta \mathbf{c}, \delta \mathbf{W}, \delta \mathbf{U} \end{aligned} $$
I have open-source code for using NADEs to do quantum state reconstruction. It is relatively new and continues to be regularily updated with more functionality. Go check it out here.
[1] B. McNaughton, M. V. Milošević, A. Perali, and S. Pilati, ArXiv:2002.04292 (2020).
[2] H. Larochelle and I. Murray, AISTATS 15, 9 (2011).
[3] B. Uria, M.-A. Côté, K. Gregor, I. Murray, and H. Larochelle, ArXiv:1605.02226 (2016).