This note is for the summer school of Machine Learning and statistical physics held at YNU, 2023.
Written by Shigang Ou ([email protected])
This note is under rapid change, and may contain mis-information or redundent content, including some extended information not mentioned in the summer school
- What is Neural Network?
- Unsupervised Learning
- Statistical Physics of Learning
- Cavity Method and 2-layer Perceptron
- Quantum Many-body Physics Introduction
- Spin Glass: Theory and applications
- Connections between ML and SP
- Statistical Physics and Statistical Inference is the same thing
- Information Compression
- Why NNs?
- What I cannot Create, I do not Understand
- Quantum Supremacy
- Statistical Physics of Learning
- Curie-Weiss Model
- Network Phase Diagram
- Implicit vs Explicit Generative Models
- Statistical Physics or Statistics?
- Formal Definition of Statistical Physics
- Learning Machine
- When to use linear and when non-linear?
- Occam's Razor
- NN dynamics and it's weight spectrum
- Other Model-free methods
- Dynamic Systems: A Prior
- Reservoir Computing
- Systemetic Prediction based on Periodic Orbits
- Monte Carlo Methods
- Quantum Annealing
- Dynamics of cell state transition
- Human population behavior and propagation dynamics
- Spiking Neural Networks
- Probalistic inference reformulated as tensor networks
- Variational Autoregressive Network
- Machine learning and chaos
- Definition and Characterization of Chaos
- Prediction of Chaos
- AI for Physics
- 2 稀疏矩阵的幂运算
- Learning nonequilibrium statistical mechanics and dynamical phase transitions
- Machine Learning, Statistical Physic, and Complex System
- 厄尔尼诺预测
- Variational Bayesian Method
- Close speech
2023-07-19, Huang: "Statistical physics of Neural Network"
Network Name | Developer Name | Year | Introduction |
---|---|---|---|
Multilayer Perceptron (MLP) | Frank Rosenblatt | 1958 | A simple and widely used type of neural network that consists of an input layer, one or more hidden layers, and an output layer of artificial neurons. |
Network | John | 1982 | A type of recurrent neural network that can store and retrieve patterns as stable states. It consists of a single layer of fully connected neurons with symmetric weights and binary activations. |
Recurrent Neural Network (RNN) | David Rumelhart et al. | 1986 | A type of neural network that can process sequential data such as text, speech, and series. It consists of a hidden layer of neurons that have recurrent connections to themselves, forming a loop. |
Convolutional Neural Network (CNN) | Yann LeCun | 1989 | A type of feedforward neural network that can process high-dimensional data such as images, videos, and speech. It consists of multiple layers of neurons that perform convolution operations on the input data, followed by pooling layers and fully connected layers. |
Spiking Neural Network (SNN) | Eugene Izhikevich et al. | 2003 | A type of neural network that mimics the behavior of biological neurons more closely than other types. It consists of spiking neurons that communicate with each other using discrete pulses or spikes. |
This list can go on and on, along with the history of winters and springs of AI. But how to understand the neural network in a more general way?
Some well-established theories in the history can still be used today
MLP is defined as:
where
WHERE IS MY BIAS?
Assume a layer with input x, weight W, bias b, and activation sigma
A typical layer is defined as:
y = \sigma(Wx + b)
Append 1 to x and b to W to get new input x_tilde and new weight W_tilde
x_tilde = (x, 1)^\top
W_tilde = (W, b)
Rewrite the output of the layer using x_tilde and W_tilde
y = \sigma(W_tilde x_tilde)
This has same output as before, but no bias term
This function is unreasonably effective in many tasks, but why?
Over-parameterization is the key.
https://arxiv.org/pdf/2008.06786.pdf: Triple Descent and a Multi-Scale Theory of Generalization
Traditional idea: more parameters, more overfitting.
This is not the full story:
$p$ is the number of parameters,$m$ is the number of training samples.Explained in 2019 by Deep Double Descent: Where Bigger Models and More Data Hurt:
Neural Networks and Physical Systems with Emergent Collective Computational Abilities:
Hopfield network is defined as:
where
This definition mimics the biological neurons like Fig.1, anything cool?
Computational ability that can generalize, categorize, correct errors, and recognize familiarity.
If we want to store a binary vector
Here
$2V-1 \in [-1, 1]$ is the binary to bipolar transformation.同种状态相互促进,不同状态相互抑制
Here weight matrix is fixed, unlike the MLP.
To retrieve a stored memory from an initial state
The update rule is asynchronous and random? Why not synchronous and deterministic?
The update rule is not important, the energy function is.
We can define the energy function of the network as:
The first term is the interaction energy between neurons, and the second term is the external energy of the neurons.
It's proved that the energy function is a Lyapunov function of the Hopfield network dynamics, which means that the energy function will decrease or remain constant at each step, until it reaches a minimum value that corresponds to a stable state of the network.
Proof:
Do the stable states acutally correspond to the stored memories?
Not necessarily. The stable states of the Hopfield network are the local minima of the energy function, which may not correspond to the stored memories. Ising Model and Spin Glass provided the answer for different
$T_{ij}$ .
This is the MC simulation result, local minima, or noise, exists. How to reduce it?
This paper find clipping the weights increase noise.
- Generalize: Restore the stored memories from a corrupted version of the memories in a consistant way.
- Categorize: Classify the input into one of the stored memories.
- Correct errors: Correct the corrupted memories to the stored memories.
- Recognize familiarity: Guide the initial state to nearest stored memories (hamming distance).
What cool application can be made out of this? In principle: MNIST digit recognition, image denoising, and even generative models.
See:
Btw, Restricted Boltzmann Machine (RBM) is a special case of Hopfield network.
Hands-on session!
For coding convenience, we can define the Hopfield network as:
Hopfield network is a Graph
where
<!-- In Progress -->
Curie-Weiss model is a mean-field model of ferromagnetism, which was introduced by Pierre Curie and Pierre Weiss in 1906. It is used to describe the behavior of ferromagnetic materials above the Curie temperature, where the magnetic moments of the atoms are randomly oriented. The model assumes that the magnetic moments are aligned in the same direction, and it predicts that the magnetization of the material is proportional to the applied magnetic field.
Energy:
This contains only 1 parameter,
$J$ .
The solution is
where
This equation has two solutions:
1 parameter saves 2 data points.
Taken from Dreaming Neural Networks: Forgetting spurious memories and reinforcing pure ones
To learn from data, the weights of the network weights
Which is derived from the maximum likelyhood estimation.
To minimize it, we take derivative of it with respect to
where
This is inverse Ising problem. Because we want to find
$J_{ij}$ from$x_i$ . i.e. design Hamiltonian to fit the data.
Phase diagram is so cool, it has been applied in NN like here:
Taken from Phase diagram for two-layer ReLU neural networks and infinite-width limit
The above model only considers two body interaction, but in reality, there's higher order interaction. (Not discussed yet)
Consider Ising model with hidden variable:
where
But we can only observe
The loss is
This is the same as the above.
Similar to the above, we take derivative of it with respect to
This is slow using MCMC. Hinton proposed a faster way: RBM. Which is a bipartite graph where hidden variable is independent of each other, and visible variable is independent of each other too.
Hinton proposed Contrastive Divergence (CD) algorithm to solve this problem.
Taken from Correlation Between Eigenvalue Spectra and Dynamics
of Neural Networks
The main contributions of the paper are:
-
It provides a clear criterion to distinguish between the chaos phase and the memory phase in asymmetric neural networks with associative memories, based on the eigenvalue spectra of the synaptic matrices.
-
It reveals a novel phenomenon of eigenvalue splitting in the memory phase, and shows that the number and positions of the split eigenvalues are related to the number and stability of the memory attractors.
-
It develops a mean-field theory to derive analytical expressions for the eigenvalue spectra and the dynamical properties of the neural network, and verifies them with numerical simulations.
A new proposal:
find different phases and order parameters of NN. Integrate existing works.
The NN can be MLP, Hopfield, RNN, CNN, SNN, etc.
One work relating to the grokking behavior in NNs:
Taken from Towards Understanding Grokking:
An Effective Theory of Representation Learning
but it does not offer any meaningful insight into the phase diagram.
This so-called phase diagram/phase transition is an abuse of terms in physics. Limited theoretical insights. More discussion see OpenReview of this paper
A reviewer claim that: There has been a couple of papers that use tools from physics and provide phase diagrams for understanding the generalization of neural networks:
- Generalisation error in learning with random features and the hidden manifold model by Gerace et al
- Multi-scale Feature Learning Dynamics: Insights for Double Descent by Pezeshki et al
- The Gaussian equivalence of generative models for learning with two-layer neural networks by Goldt et al
- Statistical mechanics for neural networks with continuous-time dynamics by Kuhn et al
Here the last paper apply Noether's theorem to NN, yielding predictions to the training dynamics of NNs.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC346238/pdf/pnas00447-0135.pdf: Neural Networks and Physical Systems with Emergent Collective Computational Abilities
set
for example, if we set
and we can use gradient descent to minimize it $$ W_{ij}'=W_{ij} + \eta (\lang x_ix_j \rang_{x\thicksim D} - \lang x_ix_j \rang_{x \thicksim p}) $$
What makes the algorithm costy is the generation of
In practice, when we trained the energy function, we often treat it as the following form
$$
E(x)=-\sum_{i}a_ix_i - \sum_j\ln{(1 + e^{\sum_{i}W_{ij}x_i + b_j})}
$$
which is derived from the following hypothesis
$$
p(x)=\frac{1}{Z}\sum_h e^{-E(x,h)}
\text{(hidden variable)}\
E(x,h)=-\sum_i a_ix_i -\sum b_j h_j - \sum_{i,j}x_iW_{ij}h_j
\text{(energy paramterization)}\
p(h|x)=\Pi_j p(h_i|x)\text{(bipartite graph)}\
p(x|h)=\Pi_i p(x_i|h)\text{(bipartite graph)}\
$$
The gradient in general would be $$ \frac{\partial \mathcal{L}}{\partial \theta} = \lang \frac{\partial E}{\partial \theta} \rang {x\thicksim D, h\thicksim p(h|x)} - \lang \frac{\partial E}{\partial \theta} \rang{x \thicksim p(x),h \thicksim p(h|x)} $$
input equals output.
Variational Autoencoder, which works as follows:
- encode the input into a distribution
$q(z|x)$ - sample
$z$ from$q(z|x)$ - decode
$z$ into a distribution$p(x|z)$ - sample
$x$ from$p(x|z)$ - minimize the negative log likelyhood of
$x$ w.r.t.$q(z|x)$ and$p(x|z)$
set
$$ \max_{G} \min_{D} \mathcal{L}(D,G) = \mathbb{E}{x\thicksim p{data}(x)}[\log{D(x)}] + \mathbb{E}{z\thicksim p{z}(z)}[\log{(1-D(G(z)))}] $$
where
Contrast Learning is a general framework for unsupervised learning. The idea is to learn a representation that makes the data distribution and the noise distribution distinguishable.
Jacot e.t.a. 2018
非磁金属中的磁性杂质:AuFe 导致自旋磁矩方向无规,长程无序。
Glass state is a non-equilibrium, non-crystalline condensed state of matter that exhibits a glass transition when heated towards the liquid state.
Edwards-Anderson model: Ising spin glass.
Sherrington-Kirkpatrick model: Heisenberg spin glass.
Goal: compute free energy
- write out Z:
$Z=\sum_{{s_i}}e^{-\beta H(s_1,\dots,s_N)}$ - use limit property:
$-\ln{Z}=\lim_{n\to 0}\frac{Z^n-1}{n}$ - expand it:
$Z^n=\sum_{{s_i^a}}e^{-\beta H(s_1^1,\dots,s_N^1)}\dots e^{-\beta H(s_1^n,\dots,s_N^n)}$ - linearize the exponent quadratic term:
$e^{-\beta H(s_1^a,\dots,s_N^a)}e^{-\beta H(s_1^b,\dots,s_N^b)}=e^{-\beta H(s_1^a,\dots,s_N^a,s_1^b,\dots,s_N^b)}$ - integrate out the Gaussian variables:
$Z^n=\int \prod_{a=1}^n \prod_{i=1}^N \frac{ds_i^a}{\sqrt{2\pi}}e^{-\frac{1}{2}(\sum_{a=1}^n \sum_{i=1}^N s_i^a)^2}e^{-\beta \sum_{a=1}^n H(s_1^a,\dots,s_N^a)}$ - Laplace approximation:
$Z^n=\int \prod_{a=1}^n \prod_{i=1}^N \frac{ds_i^a}{\sqrt{2\pi}}e^{-\frac{1}{2}(\sum_{a=1}^n \sum_{i=1}^N s_i^a)^2}e^{-\beta \sum_{a=1}^n H(s_1^a,\dots,s_N^a)}\approx \int \prod_{a=1}^n \prod_{i=1}^N \frac{ds_i^a}{\sqrt{2\pi}}e^{-\frac{1}{2}(\sum_{a=1}^n \sum_{i=1}^N s_i^a)^2}e^{-\beta \sum_{a=1}^n H(s_1^a,\dots,s_N^a)}e^{-\frac{\beta^2}{2}\sum_{a,b=1}^n \sum_{i,j=1}^N \lang s_i^a s_j^b \rang H_{ij}}$
The replica method is a technique to deal with quenched disorder in statistical physics, such as in spin glass models. The idea is to compute the average of the logarithm of the partition function by introducing replicas of the system and taking the limit of zero replicas. The partition function of n replicas can be written as:
where
The replica method assumes that there is a unique analytic function that interpolates the values of
The average free energy per spin can then be obtained by:
where
The main difficulty in applying the replica method is to find a suitable representation for the order parameter that describes the correlations between replicas. This order parameter is usually an
The matrix
The simplest assumption for the matrix
where
However, it turns out that the RS Ansatz is not valid for some systems, such as the Sherrington-Kirkpatrick (SK) model of spin glasses. In these systems, there are many metastable states that are separated by large energy barriers, and different replicas can explore different regions of phase space. This leads to a more complicated structure for the matrix
Replica symmetry breaking (RSB) is a way to generalize the RS Ansatz by allowing for different values of overlaps between replicas, depending on how they are grouped or clustered. The most general form of RSB is called full RSB, and it involves an infinite hierarchy of breaking levels. However, for some systems, such as the SK model, a simpler form of RSB is sufficient, called one-step RSB (1-RSB).
In 1-RSB, one divides the replicas into
where
The 1-RSB Ansatz captures the idea that there are clusters of states that have a higher overlap within them than between them. The parameter
Using the 1-RSB Ansatz, one can compute the free energy per spin as:
where
The self-consistency equations for
A general procedure of the replica method is as follows:
(taken from Boazbarak)
The crazy part of Replica method is that: We want to calculate
$\log Z$ , but turns out be evaluating a diferent value, and they equals each other magically! (for only SK model, no proof yet for other methods)
The cavity method is a technique to compute the average of a function of a random variable in a large system, by using a recursive equation that describes the correlations between the variable and its neighbors. It is used in statistical physics to compute the free energy of spin glass models, and in computer science to analyze the performance of message passing algorithms on random graphs.
ML algorithms don't stop at the saddle points that exists indepdently, but stay on the large (although it might seem rare) basins of attraction.
Subdominant dense clusters allow for simple learning and high computational performance in Neural netowkrs with dicere synaps.
Quantum many-body physics is the study of the behavior of systems made of many interacting particles, where quantum mechanics plays an essential role. It is used to describe a wide range of physical phenomena, from the behavior of electrons in metals and semiconductors, to the superfluidity of liquid helium, the Bose-Einstein condensation of ultracold atoms, and the superconductivity of certain materials at low temperatures.
The quantum many-body problem is the problem of finding the ground state of a quantum system with many interacting particles. It is one of the most challenging problems in physics, due to the exponential complexity of the Hilbert space of the system. The difficulty of the problem depends on the dimensionality of the system, the type of interactions, and the symmetries of the Hamiltonian.
Define
where
The ground state of the system is the state with the lowest energy, and it is the state that the system will tend to occupy at zero temperature.
In one dimension, the quantum many-body problem can be solved exactly using the Bethe ansatz. The Bethe ansatz is a method to find the eigenstates of a system with integrable Hamiltonian, by writing them as a linear combination of plane waves. It was first introduced by Hans Bethe in 1931 to solve the Heisenberg model of ferromagnetism, and it was later generalized to other models such as the Hubbard model of interacting electrons.
A quantum many-body system is a system with many interacting particles, where quantum mechanics plays an essential role. It is used to describe a wide range of physical phenomena, from the behavior of electrons in metals and semiconductors, to the superfluidity of liquid helium, the Bose-Einstein condensation of ultracold atoms, and the superconductivity of certain materials at low temperatures.
- 铁基超导体
- 锰氧化物
competing order: different order parameters compete with each other.
Macroscopic quantum phenomena: superconductivity, superfluidity, Bose-Einstein condensation.
Nobel Prize related: 1972, 2003, 2016, 2017, 2019
Bosion: superfluidity, Bose-Einstein condensation
Fermion: superconductivity
Exponential complexity of the Hilbert space of the system:
The superconductivity can be defined as:
During the writing of this note, LK-99, a material that claims to have superconductivity, is attracting public attention.
Can theory really guide the dicovery of novel material? The answer seems to be NO until now.
Experiment facts:
-
$T_c$ is the critical temperature of the superconducting phase transition. - Specific heat jump at
$T_c$ . - isotopic effect:
$T_c \propto M^{-\alpha}$ , where$M$ is the mass of the atom, and$\alpha \thicksim 0.5$ .
Theory explaining isotopic effect: BCS theory.
BCS theory basics:
- Cooper pair: bound state of two electrons.
- Cooper pair condensation: the ground state of the system is a condensate of Cooper pairs.
- BCS wavefunction:
$\ket{\Psi} = \prod_{k>0} (u_k + v_k c_k^\dagger c_{-k}^\dagger) \ket{0}$ , where$c_k^\dagger$ is the creation operator of an electron with momentum$k$ , and$u_k$ and$v_k$ are the coefficients of the wavefunction. - BCS Hamiltonian:
$H = \sum_k \xi_k c_k^\dagger c_k + \sum_{k,k'} V_{kk'} c_k^\dagger c_{-k}^\dagger c_{-k'} c_{k'}$ , where$\xi_k$ is the kinetic energy of an electron with momentum$k$ , and$V_{kk'}$ is the interaction potential between electrons with momenta$k$ and$k'$ . - BCS ground state:
$\ket{\Psi} = \prod_{k>0} (u_k + v_k c_k^\dagger c_{-k}^\dagger) \ket{0}$ , where$u_k$ and$v_k$ are the coefficients of the wavefunction. - BCS gap equation:
$\Delta_k = \sum_{k'} V_{kk'} \frac{\Delta_{k'}}{2E_{k'}} \tanh\left(\frac{\beta E_{k'}}{2}\right)$ , where$\Delta_k$ is the gap parameter,$E_k = \sqrt{\xi_k^2 + \Delta_k^2}$ is the quasiparticle energy, and$\beta = 1/k_B T$ is the inverse temperature. - The critical temperature is given by:
$k_B T_c = 1.14 \omega_D e^{-1/\lambda}$ , where$\omega_D$ is the Debye frequency and$\lambda$ is the coupling constant.
It explains the isotopic effect by :
- The mass of the atom affects the interaction potential between electrons.
- The interaction potential affects the gap parameter.
- The gap parameter affects the critical temperature.
Three key points:
- Electrons are attracted to each other due to electron-phonon interaction.
- Electrons form Cooper pairs.
- Cooper pairs condense into a superconducting state.
-
Definition: A Cooper pair is a pair of electrons with opposite spins and wave vectors
$\pm \frac{\hbar}{2}\mathbf{k}_F$ , where$\mathbf{k}_F$ is the Fermi wave vector, that are bound together by an attractive interaction mediated by phonons. -
Definition: The binding energy of a Cooper pair is the difference between the energy of two free electrons and the energy of a bound pair.
-
Theorem: Under the assumptions of a weak electron-phonon coupling and a screened Coulomb interaction, the binding energy of a Cooper pair is given by:
where
PWAnderson: Imagine yourself in a room full of people, each of whom is wearing a pair of roller skates. You are not wearing roller skates. You want to move across the room, but you cannot walk. What do you do? You grab onto the hand of the person nearest you, and the two of you move together across the room. Then you let go, grab onto the hand of the next person, and so on. In this way, you can move across the room without ever putting on a pair of roller skates. This is how electrons move through a superconductor.
Quantum Simulation using cold atom in optical lattice: Nature 452, 854 (2008)
define:
where
The ground state of the system is the state with the lowest energy, and it is the state that the system will tend to occupy at zero temperature.
The ground state of the system can be either a superfluid or a Mott insulator, depending on the values of the parameters
The superfluid phase is characterized by the presence of long-range phase coherence, which means that the phase of the wavefunction is the same at all sites. This allows the bosons to move freely through the lattice, and it leads to a finite superfluid density.
The Mott insulator phase is characterized by the absence of long-range phase coherence, which means that the phase of the wavefunction is different at different sites. This prevents the bosons from moving freely through the lattice, and it leads to a zero superfluid density.
Solve for the ground state of the system using the Gutzwiller ansatz, which is a variational ansatz that assumes that the ground state is a product state of the form:
where
The Gutzwiller ansatz is exact for the Mott insulator phase, and it gives a good approximation for the superfluid phase.
element will interact.
2023-07-23 09:11:11 Pan Zhang, SP, ML, QC
ML is "essentially" fitting the joint probability distribution of the data.
Simulation methods in one field can be used in the other fields.
Do nature essentially computing?
Is the universe a computer?
Computation is defined as the process of transforming information from one form to another. It is a fundamental process that occurs in nature, and it is the basis of all physical phenomena.
Can we mimic nature's computation?
Can we harness nature's computation?
Why there's different phase and phase transition?
Free energy: Consider all possible configurations.
At low T, liquid phase is possible, but with vanishing probability.
Nature choose the phase with lowest free energy.
Will nature fail?
Super-cooled liquid: liquid phase with lower free energy than solid phase.
It might rest at some local minimum.
OK, somes no solid, but super-cooled liquid and glass. (This is insane, but reasonable, solved some partial problem of me)
More is different: exponential complexity space, but only a few phases.
Statistical mechanics from theoretical computer science perspective: computational complexity.
Task | computational complexity |
---|---|
Partition function | #P-complete |
Free energy | #P-complete |
Observable | NP-complete |
Sampling | NP-complete |
Optimization | NP-complete |
Energy of one configuration | P |
Complexity ladders: P, NP, #P, PSPACE, BQP, BPP, ...
Complexity class | Problem | Example |
---|---|---|
P | Polynomial | Matrix multiplication |
NP | Non-deterministic polynomial | Traveling salesman problem |
#P | Counting problem | Partition function |
PSPACE | Polynomial space | Quantum circuit simulation |
Given a data set
Learn: given data
$x$ , label$y$ , learn$p(x,y)$ Usage:
- discrimitive task:
$p(y|x)=p(x,y)/p(x)$ - generative task:
$p(x|y)=p(x,y)/p(y)$
This seems to be a trivial Bayesian estimation, but problem will arise when we are dealing with high dimensional distribution (i.e.,
To deal with this, we introduce some models that give a prior distribution function
This seems to be a trivial practice as it makes no difference from just using NNs to replace
$p(x,y)$
The loss function is the difference between
we may simplify it as $$ \mathbb{KL}(\pi||p)=\lang \ln{\frac{\pi}{p}} \rang_\pi =\lang \ln{\pi} \rang_\pi - \lang \ln{p} \rang_\pi $$
in most cases, we treat
this is the Negative Log Likelyhood (what's this?), to minimize this is to maximize the likelyhood.
Another way to understand:
Given data
Define it as a partition function:
where
If we given
Given
In principle, we can't restore
A phase diagram of the problem:
Taken from Krzakala et al. 2012
How is this diagram related to physics?
Why phase diagram plays central role in physics?
Many more parameter efficient methods than NN, but NN is most suitable for GPU.
魑魅魍魉 Chi Mei Wang Liang
The difficulty of Generative > Discriminative
Quantum Mechanics want to find
$p(x)=\frac{|\Psi(x)|^2}{\int dx |\Psi(x)|^2}$
Statistical Physics want to find
$p(x)=\frac{e^{-\beta E(x)}}{\int dx e^{-\beta E(x)}}$
Parameterization of
- Forward problem: Given energy, find minimum energy state.
- Inverse problem: Given state, find energy that generates it.
Trivial case for RBM.
Generalize ability matters. How this overcome the problem from physics?
Seemly no solution yet.
Physicist attack to Learning and Inference: An Introduction to Learning and Generalisation
GPU is all you need: Current Boom in ML is due to GPU. NNs are not the best, but the most suitable for GPU.
Success of ML = Data + Algorithms + NNs + GPU
Given that computing hardware is so important, what is the currrent solution to moore law failure?
- Quantum Computing
"Nature isn't classical, dammit, and if you want to make a simulation of nature, you'd better make it quantum mechanical, and by golly it's a wonderful problem, because it doesn't look so easy." - Richard Feynman
- Neuromorphic Computing
"The brain is a computer made of meat." - Marvin Minsky
- Optical Computing
"The future of computing is light." - John Hennessy
- DNA Computing
"DNA is like a computer program but far, far more advanced than any software ever created." - Bill Gates
The above are analog computing, which have inherent problems of noise and precision. But the future of AGI might be analog computing.
Quantum Supremacy is the demonstration of a quantum computer performing a calculation that is beyond the reach of the most powerful supercomputers today.
Ising model with complex-number interaction.
Google: 10,000 years(Summit)-> 200 seconds(Sycamore)
The quantum state in the random circuit sampling problem is a superposition of all possible states, and it is given by:
where
We apply random single-qubit gates to the initial state, and we measure the final state in the computational basis. The probability of obtaining a given outcome
where
In the paper of Quantum Supremacy using a programmable superconducting processor, we have a circuit with 53 qubits, the classical computation cost can be estimated as:
in storage and computation.
Tensornetwork: solve the problem on a hardware.
How Ising model be used for learning?
Ising model can be defined :
where
Energy function must be Polynomial- computable. Otherwise, this model won't work. (But why)
It satisfies boltzmann distribution:
where
Why boltzmann distribution?
Consider
We want to maximize the entropy:
subject to the constraints:
Which means the two variable equals the sample average.
Using Lagrange multiplier, we have:
$$ \mathcal{L} = -\sum_{s} p(s) \ln{p(s)} + \alpha \Big(\sum_{s} p(s) - 1\Big)
- \sum_{i} \beta_i \Big(\sum_{s} s_i p(s) - \lang s_i \rang\Big)
- \sum_{i<j} \beta_{ij} \Big(\sum_{s} s_i s_j p(s) - \lang s_i s_j \rang\Big) $$
where
Solve for
which gives:
where
The Lagrange multipliers then be solved as:
where
That's why we use Boltzmann distribution.
Curie-Weiss model is a mean-field model of ferromagnetism, which was introduced by Pierre Curie and Pierre Weiss in 1906. It is used to describe the behavior of ferromagnetic materials above the Curie temperature, where the magnetic moments of the atoms are randomly oriented. The model assumes that the magnetic moments are aligned in the same direction, and it predicts that the magnetization of the material is proportional to the applied magnetic field.
Energy:
This contains only 1 parameter,
$J$ .
The solution is
where
This equation has two solutions:
1 parameter saves 2 data points.
<!-- why? -->
Taken from Dreaming Neural Networks: Forgetting spurious memories and reinforcing pure ones
To learn from data, the weights of the network weights
Which is derived from the maximum likelyhood estimation.
To minimize it, we take derivative of it with respect to
where
This is inverse Ising problem. Because we want to find
$J_{ij}$ from$x_i$ . i.e. design Hamiltonian to fit the data.
Phase diagram is so cool, it has been applied in NN like here:
Taken from Phase diagram for two-layer ReLU neural networks and infinite-width limit
The above model only considers two body interaction, but in reality, there's higher order interaction. (Not discussed yet)
Consider Ising model with hidden variable:
where
But we can only observe
The loss is
This is the same as the above.
Similar to the above, we take derivative of it with respect to
This is slow using MCMC. Hinton proposed a faster way: RBM. Which is a bipartite graph where hidden variable is independent of each other, and visible variable is independent of each other too.
Hinton proposed Contrastive Divergence (CD) algorithm to solve this problem.
Given a parameterized distribution
Is RBM explicit or implicit?
Since
Is Flow model explicit or implicit?
Since
Is VAE explicit or implicit?
Since
Is Autoregressive model explicit or implicit?
Since
GPT uses autoregressive model, so it is an explicit generative model.
Should the parameter be shared within the model?
Hong Zhao, XMU, 2023-07-24 09:00:21
"Statistical Physics is not of much use in Machine Learning, Statistical Physics maximize the entropy, but ML minimize it."
Serious?
Main topics in Statistical Physics:
- Boltzmann distribution
- Maxwell-Boltzmann distribution
- Darwin-Fowler distribution
- Gibbs distribution
Statistical Physics is yet to be fully developed compared to classical mechanics. No clear boundaries of application between these theories
No physics theory is possible to describe the world we mainly see. That's why ML rises.
水流,天上的云,吹刮的风,身边的一切都处在远离平衡态的状态。
Statistics have the following topics:
- Random variable
- Probability distribution
Given data
(Micro-canonical ensemble) The probability of a given microstate
where
(Canonical ensemble) The probability of a given microstate
where
(Macro-canonical ensemble) The probability of a given microstate
where
(Bose-Einstein distribution) The probability of a given microstate
where
(Fermi-Dirac distribution) The probability of a given microstate
where
Now we have a distribution over
$\Gamma$ , but what we see is an average over$t$ . How to resolve this?Ergodicity: average = ensemble average
Ergodicity hypothesis: the average of a physical quantity is equal to the ensemble average.
where
Problem related to this hypothesis:
- Poincare recurrence theorem: the system will eventually return to a state arbitrarily close to its initial state.
- Loschmidt's paradox: the -reversed dynamics of a system is not the same as the original dynamics of the system.
- Boltzmann's H-theorem: the entropy of an isolated system will increase over .
Difference from statistics:
- Data generated by Hamiltonian dynamics.
- A prior distribution is given by the Hamiltonian.
- Do not assume ergodicity: Boltzman invented
$\Gamma$ space, but don't know how to proceed. "Boltzmann School" (which throws away the ergodicity hypothesis), add assume dynamic mixing. (No proof yet) - Work strictly when
$N\to\infty$ . - Can only be applied to little systems.
Why statistical physics have no mechanics in it? (i.e. no dynamic equation)
Liouville's theorem: the phase space volume is conserved under Hamiltonian dynamics.
But it cannot provide proof for ergodicity or equa-partition theorem.
Consider Noisy-Dynamics?
"Random School" developed by Einstein, Smoluchowski, Langevin, Fokker-Planck, Kolmogorov, etc.
Boltzmann equation:
where
If the output can feedback to the system, then it's a dynamic system.
Algorithm:
- 1986: Empirical Risk Minimization (ERM):
$\min_{\theta} \sum_{i=1}^m \mathcal{L}(y_i, f(x_i;\theta))$ - 1990: Structural Risk Minimization (SRM):
$\min_{\theta} \sum_{i=1}^m \mathcal{L}(y_i, f(x_i;\theta)) + \lambda \Omega(\theta), \Omega(\theta) = \frac{1}{2} \theta^T \theta$ - 2000: Design Risk Minimization (DRM): A committee vote on the output of each model.
Can all data be linearly separable under high dimension?
Linear transformation won't span the space dimension. But non-linear could.
如无必要,勿增实体
Can we identify the shape of the drum film given the sound we here?
Math proof: No
Learning Machine does not obey Occam's Razor.
Taken from Correlation Between Eigenvalue Spectra and Dynamics
of Neural Networks
The main contributions of the paper are:
-
It provides a clear criterion to distinguish between the chaos phase and the memory phase in asymmetric neural networks with associative memories, based on the eigenvalue spectra of the synaptic matrices.
-
It reveals a novel phenomenon of eigenvalue splitting in the memory phase, and shows that the number and positions of the split eigenvalues are related to the number and stability of the memory attractors.
-
It develops a mean-field theory to derive analytical expressions for the eigenvalue spectra and the dynamical properties of the neural network, and verifies them with numerical simulations.
- Phase Space Reconstruction
- Reservoir Computing Approach
- Long Short-Term Memory (LSTM)
- Time Delay Dynamical Learning
Which one is the best?
No camparison found in literature.
Lorentz system:
where
It exhibits chaotic behavior for
Phase space reconstruction:
$$ \mathbf{x}i = (x_i, x{i-\tau}, x_{i-2\tau}, \dots, x_{i-(m-1)\tau}) $$
where
Benchmark is the key to solid fundation.
Do phase space reconstruction apply to problems without model but only data?
Yes, because it's a model-free method. It can predict the future of the system.
Reservoir computing is a machine learning framework for training recurrent neural networks. It was introduced by Herbert Jaeger in 2001, and it is inspired by the echo state network approach of Jürgen Schmidhuber.
Theory:
where
Training procedure:
- Initialize
$\mathbf{W}^\text{in}$ and$\mathbf{W}$ randomly. - For each training sample
$\mathbf{u}(t)$ , compute$\mathbf{r}(t)$ using the above equation. - Compute
$\mathbf{W}^\text{out}$ using ridge regression:$\mathbf{W}^\text{out} = \mathbf{Y} \mathbf{R}^T (\mathbf{R} \mathbf{R}^T + \lambda \mathbf{I})^{-1}$ , where$\mathbf{Y}$ is the target output matrix,$\mathbf{R}$ is the state matrix,$\lambda$ is the regularization parameter, and$\mathbf{I}$ is the identity matrix.
The target output matrix
where
- Compute
$\mathbf{W}^\text{in}$ and$\mathbf{W}$ using backpropagation through :$\frac{\partial E}{\partial \mathbf{W}^\text{in}} = \frac{\partial E}{\partial \mathbf{Y}} \frac{\partial \mathbf{Y}}{\partial \mathbf{W}^\text{in}}$ , and$\frac{\partial E}{\partial \mathbf{W}} = \frac{\partial E}{\partial \mathbf{Y}} \frac{\partial \mathbf{Y}}{\partial \mathbf{W}}$ , where$E$ is the error function,$\mathbf{Y}$ is the target output matrix,$\mathbf{W}^\text{in}$ is the input weight matrix, and$\mathbf{W}$ is the recurrent weight matrix.
The error function
where
- Repeat the above steps until convergence.
- Use
$\mathbf{W}^\text{in}$ ,$\mathbf{W}$ , and$\mathbf{W}^\text{out}$ to predict the output.
What is the ridge regression?
Ridge regression is a regularized version of linear regression. It is used to prevent overfitting in linear regression.
Jiao Wang, XMU, 2023-07-24 13:01:59
Formulation of quantum annealing
Quantum annealing is a metaheuristic for finding the global minimum of a given objective function over a given set of candidate solutions (candidate states), by a process using quantum fluctuations. Quantum annealing is used mainly for problems where the search space is discrete (combinatorial optimization problems) with many local minima; such as finding the ground state of a spin glass.
Transverse field Ising model:
where
evolution of the system is given by:
where
density of defect
how to obtain this relationship
$\rho(t_s)$ ?
Kibble-Zurek mechanism:
The Kibble-Zurek mechanism (KZM) describes the non-equilibrium dynamics of a system undergoing a continuous phase transition. It was proposed by T. W. B. Kibble in 1976 and independently by Wojciech Zurek in 1985. The KZM is based on the idea that when a system is driven through a continuous phase transition at a finite rate, the system remains in thermal equilibrium during the transition. This assumption allows one to estimate the density of topological defects in the system as a function of the rate of transition.
where
the theory is
where
This theory have no experiment parameter, but reaches good alignment with experiment data.
This is the first case to show that quantum annealing can be used to solve a problem that is not efficiently solvable by classical computers using 5300 qubits.
Q: Include the long range interaction, will it be better?
A: Maybe, no verificaton yet.
Q: They reached
Wikipedia is doing a great job in democrotizing knowledge.
Jianhua Xing, University of Pittsburgh
Mathematical considerations:
why looks like this?
$\bold{A(\vec{z})}$ is the deterministic part,$\xi(\vec{z},t)$ is the stochastic part.
where
The problem became how to find
where
We parameterize
where
is this form necessary as a prior knowledge?
In practice, you can use other basis too.
How is the noise handled?
Here we use white noise, given enough data, we can learn it too.
where
We give the change of
where
This is a Fokker-Planck equation.
A specific ansatz of the dynamical euqation is:
Have you considered 各向异性(isotropy) of the cell?
No, we work in the phase space of cell.
Qingyan Hu, 南方科技大学,复杂流动及软物中心,统计与数据科学系
Information spreading dynamics contain three key elements:
- Information feature
$\alpha_i$ - Network structure
$\gamma$ - User attribute
$f_i$
We model the information spreading dynamics as:
$$
\beta_i = \alpha_i x (1-\gamma)^{f_i(x)}
$$
where
where
Classical theory: social enforcement
where
Have your theory condiserted the inhomoegeneity of different users?
No,
$\alpha_i$ is the same for all users. But it can be different.
Here we use
where
We simplify the model as:
where
The SIR model is a compartmental model that describes the dynamics of disease spreading in a population. It is a system of ordinary differential equations that models the number of susceptible, infected, and recovered individuals in a population over .
where
In covid, the quarantine
where
The ciritical quarantine
Have you considerted the spacial factor in disease spreading
No, but can .
Integrate-and-fire model:
where
The membrane potential
Spiking neural network claims to have low power usage, and are more robust to noise.
The future of computation: Neuromorphic computing
Trainning SNN is hard, because it's not differentiable, so backpropagation cannot be used.
Mimic the brain activity to learn is enough, we don't need to know how the brain works.
Training methods:
Method | Pros | Cons |
---|---|---|
ANN-SNN conversion | Easy to implement | Low accuracy |
Approximate gradient | High accuracy | High complexity |
Neuro synaptic dynamics | High accuracy | Low performance |
They allpy STDP to train the network.
STDP is a biological process that adjusts the strength of connections between neurons in the brain. The process adjusts the connection strengths based on the relative timing of the input signals to pairs of neurons. The STDP process partially explains the activity-dependent development of nervous systems, especially with regard to long-term potentiation and synaptic plasticity.
where
They proposed a new method to train SNN called DA-STDP, which is based on the dopamine reward signal.
DA-STDP is based on the following equation:
where
the SNN is implemented in hardware
Jin-Guo Liu, HKUST
Probabilistic inference is the task of computing the posterior distribution of the hidden variables given the observed variables.
where
The joint distribution of the hidden variables and the observed variables is given by:
where
why the joint distribution is like this?
This is the definition of the joint distribution.
Graph probabilistic model: the joint distribution is given by:
$$ p(\vec{h},\vec{v}) = \frac{1}{Z} \prod_{i=1}^n \phi_i(\vec{h}_i,\vec{v}i) \prod{j=1}^m \phi_j(\vec{h}_j) $$
where
Where is the graph here?
The graph is the structure of the joint distribution.
Probabilistic Model:
<!-- See TensorInference.jl -->
Pan Zhang , ITP, 2023-07-28 14:03:12
Variational methods in statistical mechanics
Mean field assumptions:
where
Bethe ansatz
where
See lecture note of Yin Tang.
Question:
Given
where
Predict the final
A simplist example is logistic map
For stability,
Simple code to replicate it
def logistic(a):
x = [0.3]
for i in range(400):
x.append(a * x[-1] * (1 - x[-1]))
return x[-100:]
for a in np.linspace(2.0, 4.0, 1000):
x = logistic(a)
plt.plot([a]*len(x), x, "c.", markersize=0.1)
plt.xlabel("r")
plt.ylabel("x_f")
plt.show()
Another example is the Lorenz system:
where
Fractional ordinary differential equation (FODE):
where
For example, the fractional derivative of order
where
What is the physical meaning of the fractional derivative?
It is the memory of the system. The fractional derivative of order
$\alpha$ is the memory of the system of length$\alpha$ .
- Sensitivity to initial conditions
This is the property that the system is sensitive to initial conditions.
Mathematical definition:
Lynapunov exponent:
where
Example: logistic map with
If
- Topological mixing
This is the property that the system will eventually reach any state in the phase space.
Matheatical definition:
where
- Dense periodic orbits
This is the property that the system has infinite periodic orbits.
Mathematical definition:
where
- Sensitive to initial conditions
Two questions in chaos study, under the condition that the system dynamics is unknown(model-free):
-
predict chaos evolution
-
infer bifurcation diagram
Essence of statistical learning: i.i.d. assumption
Chaos synchronization: given a coupled chaotic oscillator, we want to synchronize the two oscillators.
Oscillator 1:
Oscillator 2:
where
How is this related to diffusion?
The coupling term is proportional to the difference between the two oscillators, which is similar to the diffusion term in the diffusion equation:
$\frac{\partial u}{\partial t} = D \frac{\partial^2 u}{\partial x^2}$ .
- complete synchronization
which means that the two oscillators have the same state variables when the coupling strength is greater than the critical coupling strength.
- phase synchronization
where
How is the pahse
$\theta$ defined ?
$\theta$ is the angle of the oscillator in the phase space. For example, for the logistic map,$\theta$ is the angle of the point$(x_n, x_{n+1})$ in the phase space.
- Generalized synchronization
where
Science is good, but engineering will make it great.
Why is chaos synchronization important?
Real systems are coupled, and chaos synchronization is a way to model the coupling.
Can we predict the synchronization point
One method: apply reservior computing.
To produce predictions for an unknow non-linear system is in principle difficult. Here we review some methods to predict chaotic system based on data/model, trying to grasp the essence of chaos.
Question:
Given
where
Predict the final
A simplist example is logistic map
For stability,
Simple code to replicate it
def logistic(a):
x = [0.3]
for i in range(400):
x.append(a * x[-1] * (1 - x[-1]))
return x[-100:]
for a in np.linspace(2.0, 4.0, 1000):
x = logistic(a)
plt.plot([a]*len(x), x, "c.", markersize=0.1)
plt.xlabel("r")
plt.ylabel("x_f")
plt.show()
Another example is the Lorenz system:
where
- Sensitivity to initial conditions
This is the property that the system is sensitive to initial conditions.
Mathematical definition:
Lynapunov exponent:
where
Example: logistic map with
If
- Topological mixing
This is the property that the system will eventually reach any state in the phase space.
Matheatical definition:
where
- Dense periodic orbits
This is the property that the system has infinite periodic orbits.
Mathematical definition:
where
- Sensitive to initial conditions
Two questions in chaos study, under the condition that the system dynamics is unknown(model-free):
- predict chaos evolution
- infer bifurcation diagram
A paper in 2001 use Reservoir Computing to predict chaos evolution, which is a model-free method. Here we reproduce it.
Chaos synchronization: given a coupled chaotic oscillator, we want to synchronize the two oscillators.
Oscillator 1:
Oscillator 2:
where
How is this related to diffusion?
The coupling term is proportional to the difference between the two oscillators, which is similar to the diffusion term in the diffusion equation:
$\frac{\partial u}{\partial t} = D \frac{\partial^2 u}{\partial x^2}$ .
- complete synchronization
which means that the two oscillators have the same state variables when the coupling strength is greater than the critical coupling strength.
- phase synchronization
where
How is the pahse
$\theta$ defined ?
$\theta$ is the angle of the oscillator in the phase space. For example, for the logistic map,$\theta$ is the angle of the point$(x_n, x_{n+1})$ in the phase space.
- Generalized synchronization
where
Science is good, but engineering will make it great.
Why is chaos synchronization important?
Real systems are coupled, and chaos synchronization is a way to model the coupling.
Can we predict the synchronization point
One data pair looks like this:
We can visualize the data by plotting the observations of the first system against the observations of the second system for each epsilon value.
This displays the dynamic coupling between the two systems. This can be seen as a visualization of the phase space of the coupled system.
Another visualization look at the final state difference (
Which indicate a larger bias when
Weired right? We expect the system to synchronize!
To better understand the synchronization, we plot the coefficient between
This shows that the synchronization is not perfect, but the correlation is approaching -1 when
Anyway, we can use the data to train a model to predict the synchronization point
A reservoir can be abstractly thought of as a dynamical system
The community created a package to create reservoirs and apply it to dynamical system prediction.
Here we review one of it's notebook to see how it works.
- Data format must be (time, features)
e.g. (1000, 3) for 3 features and 1000 time steps.
- If multiple sequences are provided, the data format must be (sequences, time, features)
e.g. (10, 1000, 3) for 10 sequences of 3 features and 1000 time steps.
-
Reservoir: neurons randomly connected to their inputs and to themselves, not trainnable, randomly initialized under some constraints.
e.g. 100 neurons, 0.3 spectral radius, 0.5 connectivity, 0.1 input scaling, 0.1 leak rate.
-
Readout: a decoder with a single layer of neurons, trainnable with a linear regression.
No backpropagation needed! See ridge regression.
- feedback: Readout neuron can be connected to the reservoir, to tame the reservoir dynamics. (An optional)
tame: make the reservoir dynamics more predictable.
- State vector: the reservoir state at time
$t$ is the concatenation of the reservoir neurons states at time$t$ .For every time step, the reservoir state is a vector of length
$N$ , where$N$ is the number of neurons in the reservoir.$x_t = [x_{t,1}, x_{t,2}, ..., x_{t,N}]$ , where$x_{t,i} = f(W_{in}u_t + Wx_{t-1})_i$ .This
$x_t$ will be stored for later use. It has the shape of (time, neurons). For example, if the reservoir has 30 neurons and the data has 100 time steps, then the state vector has the shape of (100, 30).
ESN: Echo State Network, a type of reservoir computing.
Now we practice building a reservoir.
from reservoirpy.nodes import Reservoir, Ridge
reservoir = Reservoir(100, lr=0.5, sr=0.9)
ridge = Ridge(ridge=1e-7)
esn_model = reservoir >> ridge
This cute
>>
is a pipeline operator, which is a shorthand foresn_model = ridge.fit(reservoir.fit(data))
. It's a syntactic sugar.
Let's first train a lorenz system.
Visualize the prediction seperately.
This is the prediction of
To see if we can predict long term evolution, we can predict
RC-ESN is SOTA in short-time weather prediction compared to ANN, RNN-LSTM, that's why it's actively researched.
Yet, recent weather-forcasting method is based on 3D-CNN, RNN-LSTM, and other models that are not based on RC-ESN.
Jie Ren, TJU, 2023-07-31 09:11:25
Why physics in AI?
Symmetry, geometry, and conservation laws are the key to understand the world.
Physics-inspired: Inspired from bird is OK, but understand hydrodynamics is the key.
神经网络已经很好,但我们应该要超过模仿人脑的水平。
Hopfield is still alive and well.
AI done on physical system is the future of computing
AI algorithm is possiblly inspired by physics: Statistical machine learning inspired by wave and diffusion system.
Protein folding is a NP-hard problem.
Yet nature is the most performant commputer.
Well, that's why analog computing is proposed and delayed due to it's inaccuracy.
Is today the best time to bring back analog computing?
Matrix Optics:
Wave can be transformed by an optical set, defined by Fresnel-Kirchhoff diffraction integral.
where
Not only the amplititue , but also the phase of the wave can be transformed by an optical set.
Will neural network be better when expanded on complex plain?
Optics is suitable for spiking neural netowork using phase transition and circular-harmoic oscillators.
Given a wave distribuiton
where
Do a simple discretization:
We can rewrite it a s a matrix form:
where
We can define the hidden state $h_t = \begin{bmatrix} u_t\ u_{t-1}\end{bmatrix}$, the weight matrix $W = \begin{bmatrix} 2I - c^2 \Delta t^2 \nabla^2 & -I\ I & 0\end{bmatrix}$, and the input $x_t = \begin{bmatrix} f_t\ 0\end{bmatrix}$, then we have
where
How is the training done?
No backpropagation needed. Pseudo-inverse is used to train the network.
Pseudo-inverse is defined as:
where
It's called psuedo-inverse because it is not defined when
To train an optical neural network, we need to solve the following equation:
The solution using pseudo-inverse is:
This is only linear , due to the difficulty in replicating non-linear activations.
Grover algorithm is a quantum algorithm that finds a specific element in an unsorted list with high probability.
To find the element
Steps of Grover algorithm:
- Initialize the state
$\ket{\psi_0} = \frac{1}{\sqrt{N}} \sum_{i=0}^{N-1} \ket{i}$ , where$N$ is the number of elements in the list. - Apply the Hadamard gate to the state
$\ket{\psi_0}$ to get the state$\ket{\psi_1} = \frac{1}{\sqrt{N}} \sum_{i=0}^{N-1} \ket{i}$ . - Apply the oracle operator
$O$ to the state$\ket{\psi_1}$ to get the state$\ket{\psi_2} = \frac{1}{\sqrt{N}} \sum_{i=0}^{N-1} (-1)^{f(i)} \ket{i}$ , where$f(i) = 1$ if$L[i] = x$ , and$f(i) = 0$ if$L[i] \ne x$ . - Apply the diffusion operator
$D$ to the state$\ket{\psi_2}$ to get the state$\ket{\psi_3} = \frac{1}{\sqrt{N}} \sum_{i=0}^{N-1} (-1)^{f(i)} \ket{i}$ . - Repeat steps 3 and 4 for
$k$ times. - Measure the state
$\ket{\psi_3}$ to get the index$i$ .
Why this works, an intuitive explanation?
The oracle operator
$O$ flips the sign of the state$\ket{i}$ if$L[i] = x$ , and does nothing if$L[i] \ne x$ . The diffusion operator$D$ flips the sign of the state$\ket{i}$ if$L[i] \ne x$ , and does nothing if$L[i] = x$ .So the oracle operator
$O$ and the diffusion operator$D$ together can amplify the amplitude of the state$\ket{i}$ if$L[i] = x$ , and reduce the amplitude of the state$\ket{i}$ if$L[i] \ne x$ . So after$k$ iterations, the state$\ket{i}$ with$L[i] = x$ will have a much higher amplitude than the state$\ket{i}$ with$L[i] \ne x$ , so we can measure the state$\ket{i}$ to get the index$i$ .
Real intelligence is creation.
看视频的智能<<创造视频的智能
Implementation of Grover algorithm using optical neural network:
Using surface wave materials to implement the oracle operator
$O$ and the diffusion operator$D$ ?
Dimension reduction is the key to understand the world. It's the essence of understanding.
Two ingredient of Learning:
- Manifold learning
- Diffusion mapping
<!-- To be Continued -->
Given data
where
Gaussian kernel is used here, because it is a smooth function.
We define the diagonal matrix
where
We define the probability matrix
where
Then the Laplacian matrix
where
The equation of motion can be written as:
which is the diffusion equation. Where
what ds the difference between row and column normalization?
They represent different physics systems. Row normalization is diffusion, and column normalization is heat conduction.
$\red{\text{What makes them different?}}$
Google PageRank is essentially this idea, but it developed a faster algorithm. (well, is left and wright different?)
设W为所有网页的集合,其中的每个网页都可以从根网页开始经过有限步的跳转被访问。
定义G为$n\times n$的连接矩阵,满足$g_{ij}=1$如果网页i有指向网页j的链接。否则设为0.
它描述了互联网的拓扑结构,是一个非常巨大,但是很稀疏sparse
的矩阵。
实际上,矩阵中所有非零元的数量就是万维网中的超链接数量。
用户在一个网页可能跟随下一个链接(概率p),也可能随机跳转到任何一个网页,因此可以得到转移概率矩阵
首先定义$r_i=\sum_j h_{ij}$,它是从i网页可以连出去的j网页的数量
那么可以得到转移概率矩阵W
注意到矩阵W是对连接矩阵行的求和的scaling
,此时得到的就是从一个网页跳转到另外一个网页的概率。实际上通常让$p=0.85$,考虑$n=4\times 10^9$,那么这里的每个值都是很小的。
由此我们得到了马尔可夫转移概率矩阵,其元素严格介于0和1,并且每行之和严格等于1(即必然会跳出这个网页)。
计算网页的顺序,简单来说就是计算稳定的分布,并按照概率进行评分,即计算方程
的解,其中$x$ 满足$\sum x_i=1$
实际运算时,一种可行的方案是从一个状态开始,不断的计算$x=Ax$,直到迭代的两个矢量的差的模长小于一个特定的值。
Goggle的实际运行方式其实根本不涉及幂运算,因为$A^k$的计算可以以通过更新超链接的权重进行
在Matlab中计算PageRank利用了马尔可夫矩阵的特定结构,以下的方法可以保持矩阵的稀疏性,将矩阵写为
其中矩阵$D$定义为:如果$r_i\ne0~,d_{ii}=1/r_i$,否则$d_{ii}=0$
而$e$为n维矢量,分量均为1.
从而方程可写为
其中$γ = z^T x.$
虽然我们不知道$\gamma$的值,但是只要我们求解了$(1-pGD)x=e$的$x$,就可以通过$\sum x_i=1$直接确定满足的解。
因此问题转化为求解
Google implement a
$\alpha\approx 0.85$ that stops the chain from trapping. But the problem we are dealling here require$n$ eigen value and vector, and is working on a much smaller system.
Any way, what left is to compute the final distribuiton of the chain.
Find solution for
Ergodicity? It dependes on the connectivity of the graph
This method claims to reached a new representation of the data, and can define a new distance
$\rho$ between the data points.
<!-- Under construction -->
Focker-Planck equation:
where
Langevin equation:
where
what if this is not gaussian noise. The equivalence between the Focker-Planck equation and the Langevin equation:
Given
we have
Then
As
reducing to the Focker-Planck equation.
Topological phonon Hall effect is the phenomenon that the phonon Hall conductance is quantized in a topological insulator, so the heat can only be conducted in one direction.
Theory:
where
What is the physical meaning of the stiffness matrix?
It is the matrix of the spring constants.
We have
$$ J = \frac{\hbar}{4V}\sum_{\sigma, \sigma' ,\bold{k}} \frac{\omega_\sigma, \bold{k}}{\omega_\sigma, \bold{k}} \bold{a}{\sigma, \bold{k}} (\frac{\partial \Omega^2}{\partial \bold{k}} + [A,\Omega^2]){\sigma, \sigma'} a_{\sigma', \bold{k}} $$
where
The normal velocity
How theory classify different phases?
By the topological invariant, e.g. Chern number.
Problems arise when we are dealing with amorphous topological phonon strucutures, because the topological invariant is not well defined.
Why classification of different phase important?
Previous diffusion model, we define
but for heat conduction, we define
What is the difference between the two?
Recommedation system is a system that recommends items to users based on their preferences, if we denote the preference of user
where
Consider a book selling situation, here item is book, user is reader, and feature is genre.
The central challenge in recommendation system:
After some tedious derivaion, our gioal is still find the eigen values and eigen vectors. It apply finite iterations to do this. (with truncation)
yin Tang, BNU, 2023-07-31 13:04:59
Dynamic system can be roughly divided into deterministic and stohastic
Stochastic dynamics (analytical on blackbooard)
- Markovian, discrete/continuous state
Machine learning discrete-state stochasitc dynamics (coding)
- Chemical master eq (Gillespie algorithm, VAN)
- Nonequilibrium statitical physics (Langevin equation, Fokker-Planck equation)
Machine learning continuous-state stochasitic dynmaics;
- Flow model to Langevin dynamics
State | Discrete | Continuous |
---|---|---|
Trajectory view | Recursion equation(e.g. Ranodm walk) | Stochastic differential (e.g. Ornstein-Uhlenbeck process) |
Distribution view | Master equation(e.g. random walk, birth-death, chemical master equation) | Fokker-Planck equation(e.g. diffusion equation) |
Brownian motion:
where
It's statistical property is:
where
Now we have
Given a random walk on a line with movement 1, with equal probability on left or right
The final distribution of the data is
where
Distribuiton view: random walk
The change in probability overthe time interval
Take continuous-time space limit and use diffusoion constant:
where
can we dervie the distribution from this equation?
Solve the equation:
- Fourier transform:
$P(x,t) = \int_{-\infty}^\infty \frac{dk}{2\pi} \tilde{P}(k,t) e^{ikx}$ - Then we have:
$\frac{\partial \tilde{P}}{\partial t} = -Dk^2 \tilde{P}$ - The solution is:
$\tilde{P}(k,t) = \tilde{P}(k,0) e^{-Dk^2 t}$ - Inverse Fourier transform:
$P(x,t) = \int_{-\infty}^\infty \frac{dk}{2\pi} \tilde{P}(k,0) e^{-Dk^2 t} e^{ikx}$ - The solution is:
$P(x,t) = \frac{1}{\sqrt{4\pi Dt}} e^{-\frac{x^2}{4Dt}}$
Here we have
$4D$ , but what we commonly see is$2D$ , this is because we have two directions, and the diffusion coefficient is the average of the two directions.
$$ D = \frac{1}{2} (D_x + D_y)$$ So
$$ P(x,t) = \frac{1}{\sqrt{2\pi Dt}} e^{-\frac{x^2}{2Dt}}$$ is for 1- Ddiffusion, and
$4D$ is for 2-D diffusion.For 3-D diffusion, it is
$6D$ .
If you got 7 solutions to a problem, you finally understand it. -- Feynman
A Random Walk Down Wall Street: The Time-Tested Strategy for Successful Investing. -- Burton G. Malkiel
An experiemnt gives
To walk for around
This is too slow.
If we put food to the side, it will have a drift velocity
where
Random walk is everywhere, e.g. complex network, search problem, complex system, and anomalous diffusion.
Every stochastic process can be mapped to a random walk.
seriously ?
Chemical master equation is a stochastic model for chemical reactions. It is stochastic with fluctuations, many chemical species beyond a single "walker".
One specis., one reactin:
Continuous: Reaction Rate Equation
which is the rate of change of the number of molecules.
Discrete: Chemical Master Equation
where
The discre description is more accurate, but it is hard to solve for large
Define
We have the equation
where
How to solve it?
Steady state:
We can use a generating function to solve it:
- Define
$G(z) = \sum_{n=0}^N z^n P_t(n)$ - Then we have
$G'(z) = \sum_{n=0}^N n z^{n-1} P_t(n)$ - And
$G''(z) = \sum_{n=0}^N n(n-1) z^{n-2} P_t(n)$ - So we have
$G''(z) = B(z) G(z) + F(z) G(z)$ - Then we have
$G''(z) = (B(z) + F(z)) G(z)$ - Then we have
$G''(z) = (k_1 z + k_2) G(z)$ - Then we have
$G''(z) - k_1 z G(z) - k_2 G(z) = 0$ - Then we have
$G(z) = A e^{r_1 z} + B e^{r_2 z}$
Solve the equation ,we have
This is only for stationary state, how about the time evolution?
Omitted.
Catalan number is the number of ways to put
We can derive it's recurrence relation using generating function:
- Define
$C(z) = \sum_{n=0}^\infty C_n z^n$ - Then we have
$C(z) = 1 + z C(z)^2$ - Then we have
$C(z)^2 - z C(z) + 1 = 0$ - Then we have
$C(z) = \frac{1 \pm \sqrt{1 - 4z}}{2z}$ - Then we have
$C(z) = \frac{1}{2z} (1 + \sqrt{1 - 4z})$ - Then we have
$C(z) = \frac{1}{2z} \sum_{n=0}^\infty \binom{1/2}{n} (1 - 4z)^n$ - Then we have
$C_n = \frac{1}{2} \binom{2n}{n}$
化学主方程是一种描述随机化学反应系统中分子数目的概率分布随时间变化的数学模型。它可以用来研究那些分子数目很少或者受到随机波动影响的反应系统,例如生物细胞内的反应。
连续的情况下,反应速率方程是一种常微分方程,它假设分子数目是一个连续变量,可以用浓度或者摩尔数来表示。反应速率方程描述了分子数目随时间的变化率,它与反应速率常数和分子数目本身有关。
表示一个一阶反应$A \xrightarrow{k} B$,其中$x$是$A$分子的数目,$k$是反应速率常数。这个方程的意义是,$A$分子的消耗速率与$A$分子的数目成正比,比例系数为$k$。这个方程的解是
其中$x(0)$是初始时刻$A$分子的数目。这个解表明,随着时间的增加,$A$分子的数目呈指数衰减。
离散的情况下,化学主方程是一种偏微分方程,它假设分子数目是一个离散变量,只能取整数值。化学主方程描述了分子数目的概率分布随时间的变化率,它与转移概率有关。转移概率是指在一个很小的时间间隔内,系统从一个状态跳到另一个状态的概率。
也表示一个一阶反应$A \xrightarrow{k} B$,其中$n$是$A$分子的数目,$P_t(n)$是在时刻$t$系统处于状态$n$的概率。这个方程的意义是,状态$n$的概率随时间的变化率等于从状态$n+1$跳到状态$n$和从状态$n$跳到状态$n-1$的概率之差。从状态$n+1$跳到状态$n$的概率为$k(n+1)P_t(n+1)$,因为每个$A$分子都有$k\Delta t$的概率在$\Delta t$内发生反应,并且有$n+1$个这样的分子。从状态$n$跳到状态$n-1$的概率为$k n P_t(n)$,原因同上。这个方程可以用递推公式或者生成函数来求解.
高情商:时代变了,大家解析能力不一样了
Solve function
We use characteristic line method, which is based on the idea that the solution of the PDE is constant along the characteristic lines, which are the integral curves of the vector field
An intuitive explanation: the solution of the PDE is constant along the characteristic lines, because the PDE is the rate of change of the solution, and the characteristic lines are the lines that the solution is constant.
# Define the PDE of the form a(x,y)u_x + b(x,y)u_y = c(x,y)u + d(x,y)
# Define the boundary conditions for u
# Define the domain and the grid size
# Initialize the grid and the solution matrix
# Loop over the grid points
# Check if the point is on the boundary
# If yes, use the boundary condition
# If no, use the characteristic line method
# Find the slope of the characteristic curve using dy/dx = b(x,y)/a(x,y)
# Find the ODE along the characteristic curve using du/dx = (d(x,y) - c(x,y)u)/a(x,y)
# Find the previous point on the characteristic curve using backward difference
# Interpolate the value of u at the previous point using linear interpolation
# Solve the ODE along the characteristic curve using a numerical method
# Store the value of u at the current point in the solution matrix
# End loop
# Plot or output the solution
# Import libraries
import numpy as np
import matplotlib.pyplot as plt
# Define the functions a, b, c, d
def a(x,y):
return 1
def b(x,y):
return 1
def c(x,y):
return 0
def d(x,y):
return 0
# Define the boundary conditions for u
def u_left(y):
return np.sin(y)
def u_right(y):
return np.cos(y)
# Define the domain and the grid size
x_min = 0
x_max = 1
y_min = 0
y_max = np.pi
n_x = 11 # number of grid points in x direction
n_y = 11 # number of grid points in y direction
h_x = (x_max - x_min) / (n_x - 1) # grid spacing in x direction
h_y = (y_max - y_min) / (n_y - 1) # grid spacing in y direction
# Initialize the grid and the solution matrix
x = np.linspace(x_min, x_max, n_x)
y = np.linspace(y_min, y_max, n_y)
u = np.zeros((n_x, n_y))
# Define a function that computes the slope of the characteristic curve
def slope(x,y):
return b(x,y) / a(x,y)
# Define a function that computes the ODE along the characteristic curve
def ode(x,y,u):
return (d(x,y) - c(x,y) * u) / a(x,y)
# Define a numerical method to solve the ODE (Euler's method)
def euler(x,y,u,h):
return u + h * ode(x,y,u)
# Loop over the grid points
for i in range(n_x):
for j in range(n_y):
# Check if the point is on the left or right boundary
if i == 0:
# Use the left boundary condition
u[i,j] = u_left(y[j])
elif i == n_x - 1:
# Use the right boundary condition
u[i,j] = u_right(y[j])
else:
# Use the characteristic line method
# Find the previous point on the characteristic curve
x_prev = x[i-1]
y_prev = y[j] - h_x * slope(x_prev, y[j])
# Interpolate the value of u at the previous point
u_prev = np.interp(y_prev, y, u[i-1,:])
# Solve the ODE along the characteristic curve
u[i,j] = euler(x_prev, y_prev, u_prev, h_x)
# Plot or output the solution
plt.contourf(x, y, u.T)
plt.xlabel('x')
plt.ylabel('y')
plt.title('Solution of PDE by characteristic line method')
plt.colorbar()
plt.show()
This image is a solution for the PDE
Gillespie algorithm is a stochastic simulation algorithm for chemical master equation. It is a Monte Carlo method that simulates the time evolution of a chemical system.
A pseudo-code of the algorithm:
# Define the system with the initial number of molecules and the reaction rate constants
# Initialize the time to zero
# Loop until the end time or condition is reached
# Calculate the total propensity function by summing all the reaction propensities
# Generate two random numbers from a uniform distribution between 0 and 1
# Use one random number to determine the time interval until the next reaction event
# Use another random number to determine which reaction will occur next
# Update the system by increasing the time by the time interval and changing the number of molecules according to the chosen reaction
# End loop
# Plot or output the results
This algorithm is based on the following theory:
- The time interval until the next reaction event is an exponential random variable with the rate parameter equal to the total propensity function:
$t \sim \text{Exp}(\alpha_0)$ - The probability of each reaction event is proportional to its propensity function:
$\text{Pr}(R_i) = \frac{\alpha_i}{\alpha_0}$ - The number of molecules after each reaction event is a multinomial random variable with the number of trials equal to the number of molecules before the reaction event and the probability of each outcome equal to the probability of each reaction event:
$X \sim \text{Multinomial}(n, \text{Pr}(R_i))$ - The number of molecules after each reaction event is a Poisson random variable with the rate parameter equal to the propensity function:
$X \sim \text{Poisson}(\alpha_i)$
# Import libraries
import numpy as np
import matplotlib.pyplot as plt
# Define the system with two reactions: A -> B and B -> A
# The initial number of molecules for A and B are 10 and 0, respectively
# The reaction rate constants for A -> B and B -> A are 1.0 and 0.5, respectively
x_A = 10 # number of A molecules
x_B = 0 # number of B molecules
k_1 = 1.0 # rate constant for A -> B
k_2 = 0.5 # rate constant for B -> A
# Initialize the time to zero
t = 0
# Create empty lists to store the time and molecule values
t_list = []
x_A_list = []
x_B_list = []
# Loop until the end time of 10 is reached
while t < 10:
# Calculate the total propensity function
a_total = k_1 * x_A + k_2 * x_B
# Generate two random numbers from a uniform distribution between 0 and 1
r_1 = np.random.uniform(0,1)
r_2 = np.random.uniform(0,1)
# Use one random number to determine the time interval until the next reaction event
tau = (1 / a_total) * np.log(1 / r_1)
# Use another random number to determine which reaction will occur next
if r_2 < (k_1 * x_A) / a_total:
# A -> B occurs
x_A -= 1 # decrease A by 1
x_B += 1 # increase B by 1
else:
# B -> A occurs
x_A += 1 # increase A by 1
x_B -= 1 # decrease B by 1
# Update the time by adding the time interval
t += tau
# Append the time and molecule values to the lists
t_list.append(t)
x_A_list.append(x_A)
x_B_list.append(x_B)
# Plot or output the results
plt.plot(t_list, x_A_list, label="A")
plt.plot(t_list, x_B_list, label="B")
plt.xlabel("Time")
plt.ylabel("Number of molecules")
plt.title("Gillespie algorithm for a simple chemical system")
plt.legend()
plt.show()
Gaurateen of this algorihtm:
- Theorem: Two random variable can simulate a random walk process. (?)
Take-home message:
- Diffusion coefficient is
$2dD$ , where$d=1,2,3$ is the dimension, and$D$ is the diffusion coefficient.
We have equation
Then we can derive the final state of the system with the matrix
The chemical transformation matrx
A numeric example:
We have
The joint probability
We can propose a parameterization of the joint probability
where
The dimension of the matrix
The loss function is the Kullback-Leibler divergence between the joint probability at time
Why not implement the time
$t$ as an embedding?Tried , not promising due to the long step negative probability problem.
Is this size big enough? Given
$N$ and$M$ , the distribuiton is of dimension$N^M$ , where$\bold{n}$ indicates the number state of he systeme.g.
$\ket{1,0,0}$ means there is one molecule of species 1, and no molecule of species 2 and 3.
Ying Tang, International Academic Center of Complex Systems, Beijing Normal University
Non-equilibrium statistical mechanics is the study of the behavior of systems that are not in thermodynamic equilibrium. Most systems found in nature are not in thermodynamic equilibrium because they are not in stationary states, and are continuously and discontinuously subject to flux of matter and energy to and from other systems and to and from their environment.
where
where
Kinetically constrained models (KCMs) are lattice models of interacting particles that are subject to constraints on their dynamics. They are used to model the dynamics of glassy systems.
Why KCMs can be used to model the dynamics of glassy systems?
It provide a const
FA model, South-East model
where
How to determine the way to flip the spin?
This can be done in the following way: flip the spin with probability
$c$ if the site is occupied, and flip the spin with probability$1-c$ if the site is unoccupied.
Difficulty in applying this method to 3D systems:
- The number of states is too large.
ML can be used to estimate the dynamic partition function:
where
How to estimate the dynamic partition function?
We may use autorergressive model to estimate the dynamic partition function:
where
Consider trajectory ensemble
The dynamical partition funtion is the moment generating function with the counting field
Here counting field
This is the only work (the lecturer done) that use NN to observe things others don't.
Stochastic differential equation:
Fokker-Planck equation:
where
How to solve it?
path integral approach
This is the Langevin equation of the Ornstein-Uhlenbeck process, and can be analytically solved:
where
How is this connected to Stochastic gradient discent?
SGD works as follows:
$$x_{n+1} = x_n - \eta \nabla f(x_n) + \sqrt{2\eta D} \xi_n$$ Under continuum limit, we have
$$ \dot{x} = -\nabla f(x) + \sqrt{2D} \xi(t)$$
Use flow model to solve Fokker-Planck equation: not introduced in detail. (work in progress)
Xiaosong Chen, YNU, 2023-08-01 13:08:31
Appy it to some toy models, to see if this blackbox model is expalainable.
This model exhibits a phase transition from a non-percolating phase to a percolating phase.
普适类(i.e. universality class): 在临界点附近,系统的物理性质具有普适性,即物理量的变化服从幂律分布。
普适类的定义:在一个普适类中,所有的模型都有相同的临界指数。
Critical phenomena of percolation:
- Order parameter
$P(p)$ : the probability that a site belongs to infinite cluster
- Correlation length
$\chi(p)$ : the average distance between two sites in the same cluster, also mean cluster size
标度性(i.e. scale invariance):在临界点附近,系统的物理性质具有标度性,即物理量的变化服从幂律分布。
ML supervised learning the phase transition of percolation, input is a 2-D network, output is the probability of percolation.
Different percolation models:
- Bond percolation: each bond is occupied with probability
$p$ . - Site percolation: each site is occupied with probability
$p$ .
Differnce: bond percolation is more difficult to simulate
Finding:
- Same network can learn different percolation under the same universality class.
- Different universality class must use the same network
- ML of critical behavior should have university class.
what the hell is this all about, since it provide no theoretical insights or rigorous proof on these claims.
A completely data driven approach
Open the blackbox, what can we do?
- visualize the weight matrix
- statistical analysis of weights and output
- connect the weights and the patterns/rules/law in the problem.
The key is to represent, not solve.
Play around with the dataset.
Weather is different in different areas. have you considered the difference?
This is a global data, not limited to a region.
Have you considered the long term prediction?
Data is challenging, weather data is limited in past decades.
Well, weather prediction is now working with CNN.
Derivation of ELBO using KL divergence:
where
Fluctuation theorem is a theorem in statistical mechanics that describes the probability distribution of the time-averaged irreversible entropy production of a system that is arbitrarily far from equilibrium.
or
where
How to generate two images interpolation using generative model?
An intuitive way: make it noisy again, and interpolate noise?
Guided diffusion: ?
How GAN interpolate ?
Three color lnk in a non-newton liquid
Element table + spatial group = Crystal structure
- Optimization
Find material strucutre with high free energy and thermo-electrical value.
Boids algorithm: Birds as agents with the following rule:
- Separation: avoid crowding neighbors (short range repulsion)
- Alignment: steer towards average heading of neighbors
- Cohesion: steer towards average position of neighbors (long range attraction)
The problem of new material design lies in the ability to grow that material.
Synthesis is another proess that can be reformulated as AI problem.
Automation is the key
A new Matterial that does not absorb sun radiation, but only the radiation of the universe background, which makes the temperature lower.
众包平台: similar to the ones for protein design and protein folding.
"Go for the mess".
Find prosperious direction
Iteration: Iterate is the key in building something great. Instead of start everything from scratch, we would better build things on top of existing attempts.
Some quotes from Sam Altman (Founder/CEO of OpenAI):
-
越是在年轻的时候努力越划算,努力像利息一样,好处会随着时间越滚越多。年轻时你的义务也越少,抓紧时间找到最舒服的努力方式,最好的是跟一群相处愉快的人一起做有意思有价值的事。
-
可以摘低垂的果实,但也要试着做点不容易的事,一旦你在重要的问题上有些许进展,想帮你的人会络绎不绝。
-
勇敢追寻自己所爱,虽然你多半会失败,并饱尝被拒绝的痛苦。但一旦成功,你会发现事情比想象更顺利。不断提醒自己,乐观会最大化你的运气。