Neural Quantum States - representation of the wave function of the neural network

In this article, we will look at the unusual application of neural networks in general and Boltzmann’s limited machines in particular for solving two complex problems of quantum mechanics — the search for ground-state energy and the approximation of the wave function of a multi-body system.

It can be said that this is a free and simplified retelling of the article [2], published in Science in 2017 and some subsequent works. I did not find the popular scientific statements of this work in Russian (and only this one from the English versions), although it seemed to me very interesting.

Minimal necessary concepts from quantum mechanics and deep learning

Just want to note that these definitions are extremely simplified . I cite them for those for whom the problematic is described - a dark forest.
')
A state is simply a set of physical quantities that describe a system. For example, for an electron flying in space, these will be its coordinates and momentum, and for a crystal lattice it is a set of spins of atoms located at its sites.

The wave function of the system is a complex function of the system state. Some black box that accepts, for example, a set of spins, and returns a complex number. The main property of the wave function that is important for us is that its square is equal to the probability of this state:

P s i (s) P s i (s)^{*} = P (s)

$\ Psi (s) \ Psi (s) ^ * = P (s)$

It is logical that the square of the wave function should be normalized to one (and this is also one of the significant problems).

The Hilbert space — in our case, the definition is enough — the space of all possible states of the system. For example, for a system of 40 spins, which can take the values +1 or -1, the Hilbert space is all

2^{40}

$2 ^ {40}$ possible states. For coordinates that can take values

[- i n f t y, + i n f t y]

$[- \ infty, + \ infty]$ , the dimension of the Hilbert space is infinite. It is the huge dimension of the Hilbert space for any real systems that is the main problem that does not allow to solve equations analytically: in the process integrals / summations over the entire Hilbert space will arise, which are not calculated "head-on." A curious fact: for the entire lifetime of the Universe, only a small fraction of all possible states entering the Hilbert space can be found. This is very well illustrated by the picture from the article about Tensor Networks [1], which schematically depicts the entire Hilbert space and those states that can be found behind the polynomial of the characteristic complexity of space (the number of bodies, particles, spins, etc.)

A limited Boltzmann machine — if difficult to explain, it is an unoriented graphical probabilistic model, the limitations of which lie in the conditional independence of the probabilities of nodes of one layer from nodes of the same layer. In simple terms, this is a neural network with an input and one hidden layer. The output values of the neurons in the hidden layer can be 0 or 1. The difference from the usual neural network is that the outputs of the neurons of the hidden layer are random variables chosen with a probability equal to the value of the activation function:

P_{i} (1) = s i g m a (b_{i} + s u m_{j} W_{i j} s_{j})

$P_i (1) = \ sigma (b_i + \ sum_jW_ {ij} s_j)$

Where

s i g m a

$\ sigma$ - sigmoid activation function ,

b_{i}

$b_i$ - offset for the i-th neuron,

W

$W$ - the weight of the neural network,

s_{j}

$s_j$ - visible layer. Limited Boltzmann machines belong to the so-called "energy models", since we can express the probability of a particular state of the machine using the energy of this machine:

E (v, h) = - a^{T} v - b^{T} h - v^{T} W h

$E (v, h) = -a ^ Tv - b ^ Th - v ^ TWh$

where v and h are the visible and hidden layers, a and b are the displacements of the visible and hidden layers, W is the weights. Then the probability of the state can be represented as:

P (v, h) = f r a c 1 Z e^{- E (v, h)}

$P (v, h) = \ frac {1} {Z} e ^ {- E (v, h)}$

where Z is the normalization term, also called the partition sum (it is necessary for the total probability to be equal to one).

Introduction

Today, among specialists in deep learning there is an opinion that limited
Boltzmann machines (hereinafter referred to as OMB) is an outdated concept that is practically inapplicable in real-world tasks. However, in 2017, an article [2] was published in Science, which showed a very effective use of the OMB for problems of quantum mechanics.

The authors noticed two important facts that may seem obvious, but never before did anyone think of it:

OMB is a neural network, which, according to Tsybenko’s universal theorem , can theoretically approximate any function with arbitrarily high accuracy (there are many more restrictions, but you can skip them).
OMB is a system, the probability of each state of which is a function of the input (visible layer), weights and displacements of the neural network.

Well, then the authors said: let our system be fully described by the wave function, which is the root of the energy of the OMB, and the OMB inputs are the characteristics of our system state (coordinates, spins, etc.):

P s i (s) = f r a c 1 Z s q r t e^{E (s, h)}

$\ Psi (s) = \ frac {1} {Z} \ sqrt {e ^ {E (s, h)}}$

where s is the characteristics of the state (for example, spins), h is the outputs of the hidden layer of the OMB, E is the energy of the OMB, Z is the normalization constant (statistical sum).

Everything, article in Science is ready, only a few small details remain. For example, it is necessary to solve the problem of an uncomputable statistical sum due to the gigantic size of Hilbert space. Tsybenko’s theorem also tells us that a neural network can approximate any function, but it doesn’t say at all how we can find a suitable set of weights and offsets for the network. Well, and as usual, here the fun begins.

Model training

Now there are quite a few modifications of the original approach, but I will consider only the approach from the original article [2].

Task

In our case, the task of learning will be as follows: to find an approximation of the wave function that would make the most likely state with the minimum energy. Intuitively, this is understandable: the wave function gives us the probability of a state, the eigenvalue of the Hamiltonian (the energy operator, or even simpler, energy — this understanding is sufficient for this article) for the wave function is energy. It's simple.

In reality, we will strive to optimize another quantity, the so-called local energy, which is always greater than or equal to the ground state energy:

E_{l o c} (s i g m a) = R e s u m_{s i g m a s i g m a^{'}} H_{s i g m a s i g m a^{'}} f r a c P s i (s i g m a^{'}) P s i (s i g m a)

$E_ {loc} (\ sigma) = Re \ sum _ {\ sigma \ sigma '} H _ {\ sigma \ sigma'} \ frac {\ Psi (\ sigma ')} {\ Psi (\ sigma)}$

here

s i g m a

$\ sigma$ - this is our state,

s i g m a^{'}

$\ sigma '$ - all possible states of the Hilbert space (in reality, we will consider a more approximate value),

H_{s i g m a s i g m a^{'}}

$H _ {\ sigma \ sigma '}$ Is the matrix element of the Hamiltonian. It strongly depends on the specific Hamiltonian, for example, for the Ising model, this is just

f (s i g m a)

$f (\ sigma)$ , if a

s i g m a = s i g m a^{'}

$\ sigma = \ sigma '$ and

- c o n s t

$-const$ in all other cases. It is not worthwhile to dwell on this now; It is important that these elements be found for various popular Hamiltonians.

Optimization process

Sampling

An important part of the approach from the original article was the sampling process. A modified variation of the Metropolis-Hastings algorithm was used. The bottom line is as follows:

We start from a random state.
We change the sign of a randomly selected spin to the opposite (for the coordinates, other modifications, but they also exist).
With probability equal to $P (\ sigma '| \ sigma) = \ Big | {\ frac {\ Psi (\ sigma')} {\ Psi (\ sigma)}} \ Big | ^ 2$ , go to the new state.
Repeat N times.

As a result, we obtain a set of random states selected in accordance with the distribution that our wave function gives us. You can calculate the energy values in each state and the expected value of energy

m a t h b b E (E_{l o c})

$\ mathbb {E} (E_ {loc})$ .

It can be shown that the estimate of the energy gradient (more precisely, the expected value of the Hamiltonian) is equal to:

G_{k} (x) = 2 * (E_{l o c} (x) - m a t h b b E (E_{l o c})) * D_{k}^{*} (x)

$G_k (x) = 2 * (E_ {loc} (x) - \ mathbb {E} (E_ {loc})) * D ^ * _ k (x)$

Conclusion

This is from the G. Carleo lectures he gave in 2017 for Advanced School on Quantum Science and Quantum Technology. There are records on Youtube.

Denote:

D_{k}^{*} (x) = f r a c p a r t i a l_{p_{k}} P s i (x) P s i (x)

$D ^ * _ k (x) = \ frac {\ partial_ {p_k} \ Psi (x)} {\ Psi (x)}$

Then:

p a r t i a l_{p_{k}} m a t h b b E (H) =

$\ partial_ {p_k} \ mathbb {E} (H) =$

p a r t i a l f r a c s u m_{x x^{'}} P s i^{*} (x) H_{x x^{'}} P s i (x^{'}) s u m_{x} | P s i (x) |^{2} =

$\ partial \ frac {\ sum_ {xx '} \ Psi ^ * (x) H_ {xx'} \ Psi (x ')} {\ sum_x | \ Psi (x) | ^ 2} =$

f r a c s u m_{x x^{'}} P s i^{*} (x) H_{x x^{'}} D_{k} (x^{'}) P s i (x^{'}) s u m_{x} | P s i (x) |^{2} + f r a c s u m_{x x^{'}} P s i^{*} (x) D_{k}^{*} (x) H_{x x^{'}} P s i (x^{'}) s u m_{x} | P s i (x) |^{2} -

$\ frac {\ sum_ {xx '} \ Psi ^ * (x) H_ {xx'} D_k (x ') \ Psi (x')} {\ sum_x | \ Psi (x) | ^ 2} + \ frac {\ sum_ {xx '} \ Psi ^ * (x) D_k ^ * (x) H_ {xx'} \ Psi (x ')} {\ sum_x | \ Psi (x) | ^ 2} -$

f r a c s u m_{x x^{'}} P s i^{*} (x) H_{x x^{'}} P s i (x^{'}) s u m_{x} | P s i (x) |^{2} f r a c s u m_{x} | P s i (x) |^{2} (D_{k} (x) - D_{k}^{*} (x)) s u m_{x} | P s i (x) |^{2} =

$\ frac {\ sum_ {xx '} \ Psi ^ * (x) H_ {xx'} \ Psi (x ')} {\ sum_x | \ Psi (x) | ^ 2} \ frac {\ sum_x | \ Psi (x) | ^ 2 (D_k (x) - D ^ * _ k (x))} {\ sum_x | \ Psi (x) | ^ 2} =$

f r a c s u m_{x x^{'}} f r a c P s i^{*} (x) P s i^{*} (x^{'}) H_{x x^{'}} D_{k} (x^{'}) | P s i (x^{'}) |^{2} + s u m_{x x^{'}} | P s i (x) |^{2} H_{x x^{'}} D_{k}^{*} (x^{'}) f r a c P s i (x^{'}) P s i (x) s u m_{x} | P s i (x) |^{2} -

$\ frac {\ sum_ {xx '} \ frac {\ Psi ^ * (x)} {\ Psi ^ * (x')} H_ {xx '} D_k (x') | \ Psi (x ') | ^ 2 + \ sum_ {xx '} | \ Psi (x) | ^ 2H_ {xx'} D ^ * _ k (x ') \ frac {\ Psi (x')} {\ Psi (x)}} {\ sum_x | \ Psi (x) | ^ 2} -$

m a t h b b E (H) f r a c s u m_{x} | P s i (x) |^{2} (D_{k} (x) + D_{k}^{*} (x)) s u m_{x} | P s i (x) |^{2} a p p r o x

$\ mathbb {E} (H) \ frac {\ sum_x | \ Psi (x) | ^ 2 (D_k (x) + D ^ * _ k (x))} {\ sum_x | \ Psi (x) | ^ 2 } \ approx$

m a t h b b E (E_{l o c} D_{k}^{*}) - m a t h b b E (E_{l o c}) m a t h b b E (D_{k}^{*}) + C

$\ mathbb {E} (E_ {loc} D ^ * _ k) - \ mathbb {E} (E_ {loc}) \ mathbb {E} (D ^ * _ k) + C$

Then just solve the optimization problem:

Let's sample the states from our OMB.
We calculate the energy of each state.
Estimate the gradient.
We update the OMB weights.

As a result, the energy gradient tends to zero, the energy value decreases, as well as the number of unique new states in the Metropolis-Hastings process, because by sampling from the true wave function we will almost always get the ground state. Intuitively, this seems logical.

In the original work for small systems, the values of the ground state energy were obtained, which are very close to the exact values obtained analytically. A comparison was made with known approaches to finding the ground state energy, and NQS won, especially given the relatively low computational complexity of NQS compared to known methods.

NetKet - library from the "inventors" approach

One of the authors of the original article [2] with his team developed an excellent NetKet library [3], which contains a very well-optimized (in my opinion) C-core, as well as the Python API, which works with high-level abstractions.

The library can be installed via pip. Windows 10 users will have to use Linux Subsystem for Windows.

Consider working with the library on the example of a chain of 40 spins, taking values + -1 / 2. We will consider the Heisenberg model, in which neighboring interactions are taken into account.

NetKet has excellent documentation that allows you to quickly figure out what to do and how. There are many built-in models (backs, bosons, Ising models, Heisenberg, etc.), and the ability to fully describe the model itself.

Graph description

All models are represented as graphs. For our chain, the built-in model Hypercube with one dimension and periodic boundary conditions will suit:

import netket as nk graph = nk.graph.Hypercube(length=40, n_dim=1, pbc=True)

Description of Hilbert space

Our Hilbert space is very simple - all spins can take values either +1/2 or -1/2. For this case, the built-in model for spins is suitable:

 hilbert = nk.hilbert.Spin(graph=graph, s=0.5)

Hamiltonian Description

As I already wrote, in our case the Hamiltonian is the Heisenberg Hamiltonian, for which there is a built-in operator:

 hamiltonian = nk.operator.Heisenberg(hilbert=hilbert)

RBM Description

In NetKet, you can use a ready-made implementation of RBM for spins - just our case. But in general there are many cars, you can try different ones.

 nk.machine.RbmSpin(hilbert=hilbert, alpha=4) machine.init_random_parameters(seed=42, sigma=0.01)

Here alpha is the density of neurons in the hidden layer. For 40 visible and alpha 4 neurons there will be 160 of them. There is another way of indicating, directly by number. The second command initializes weights randomly from

N (0, s i g m a)

$N (0, \ sigma)$ . In our case, sigma is equal to 0.01.

Samler

A sampler is an object that will return to us a sample from our distribution, which is given on the Hilbert space by a wave function. We will use the Metropolis-Hastings algorithm described above, modified for our task:

 sampler = nk.sampler.MetropolisExchangePt( machine=machine, graph=graph, d_max=1, n_replicas=12 )

To be perfectly precise, the sampler is a trickier algorithm than the one I described above. Here we simultaneously check as many as 12 options in parallel to select the next point. But the principle is generally the same.

Optimizer

This describes the optimizer that will be used to update the model weights. From personal experience with neural networks in more “familiar” areas for them, the best and most reliable option is the good old stochastic gradient descent with a moment (well described here ):

 opt = nk.optimizer.Momentum(learning_rate=1e-2, beta=0.9)

Training

NetKet has training both without a teacher (our case) and with a teacher (for example, the so-called “quantum tomography”, but this is a topic for a separate article). Just describe the "teachers", and that's it:

 vc = nk.variational.Vmc( hamiltonian=hamiltonian, sampler=sampler, optimizer=opt, n_samples=1000, use_iterative=True )

The variational Monte Carlo indicates how we estimate the gradient of the function that we optimize. n_smaples is the size of the sample from our distribution that the sampler returns.

results

We will run the model as follows:

 vc.run(output_prefix=output, n_iter=1000, save_params_every=10)

The library is built using OpenMPI, and the script will need to be run like this: mpirun -n 12 python Main.py (12 is the number of cores).

The results I received are as follows:

On the left is the graph of energy from the epoch of learning, on the right - the dispersion of energy from the epoch of learning.
It can be seen that 1000 epochs are clearly redundant, 300 would be enough. In general, it works very cool, converges quickly.

Literature

Matrix product states and projected entangled pair states // Annals of Physics. - 2014. - T. 349. - p. 117-158.
Carleo G., Troyer M. Solving the quantum many-body problem with artificial neural networks // Science. - 2017. - V. 355. - №. 6325. - p. 602-606.
www.netket.org

Source: https://habr.com/ru/post/445516/

All Articles