The enormous capabilities of neural networks are sometimes comparable to their unpredictability. Now mathematicians are beginning to understand how the form of a neural network affects its work.

When we design a skyscraper, we expect that it will eventually meet all specifications: that the tower can withstand such a weight, as well as an earthquake of a certain force.
')
However, one of the most important technologies of the modern world we, in fact, design blindly. We play with different schemes, different settings, but until we start the trial run of the system, we actually have no idea what it can do or where it refuses to work.
It's about the neural networks technology that underlies the most advanced modern artificial intelligence systems. Neural networks are gradually moving into the most basic areas of society: they determine what we learn about the world from news feeds in social networks, they
help doctors make a diagnosis, and even
affect whether they send the offender to prison.
And at the same time, “the best description of what we know is to say that we know practically nothing about how neural networks actually work and what the theory describing them should be,” said
Boris Ganin , a mathematician at the University of Texas, and an invited Facebook expert at AI Research, who studies neural networks.
He compares the situation with the development of another revolutionary technology: the steam engine. Initially, steam engines could only pump water. Then they served as engines for steam locomotives, and neural networks today have reached, probably, about the same level. Scientists and mathematicians developed a theory of thermodynamics, which allowed them to understand what exactly is happening inside any engine. And in the end, such knowledge brought us into space.
“At first there were excellent engineering achievements, then excellent trains, and then it took a theoretical understanding to move from this to rockets,” said Ganin.
In the growing community of neural network developers, there is a small group of mathematical researchers trying to create a neural network theory that can explain how they work and ensure that after creating a neural network of a certain configuration, it can perform certain tasks.
While the work is at a very early stage, but over the past year, researchers have already published several scientific papers describing in detail the interrelation of the form and functioning of neural networks. The work describes neural networks in full, right down to their very foundations. It demonstrates that long before confirming the ability of neural networks to drive cars, it is necessary to prove their ability to multiply numbers.
The best brain recipe
Neural networks tend to imitate the human brain - and one of the ways to describe his work is to say that he merges small abstractions into larger ones. From this point of view, the complexity of thoughts is measured by the number of small abstractions underlying them, and the number of combinations of low-level abstractions into high-level abstractions — for example, in tasks such as studying the differences between dogs and birds.
“If a person learns to recognize a dog, then he learns to recognize something shaggy on four legs,” said
Maitra Ragu , a computer science graduate student at Cornell University, a member of the
Google Brain team. “Ideally, we would like our neural networks to do something like this.”
Maitra RaguAbstraction originates in the human brain in a natural way. Neural networks have to work for this. Neural networks, like the brain, are made up of building blocks called “neurons,” connected in various ways to each other. At the same time, neurons of a neural network, although made in the image of brain neurons, do not try to imitate them completely. Each neuron can represent an attribute or combination of attributes that the neural network considers at each level of abstraction.
Engineers have a choice of many options for combining these neurons. They need to decide how many layers of neurons should be on the neural network (that is, determine its “depth”). Consider, for example, a neural network that recognizes images. The image is included in the first layer of the system. On the next layer, the network may have neurons that simply recognize the edges of the image. The next layer combines lines and defines curves. The next one combines the curves into shapes and textures, and the latter processes the shapes and textures to decide what he is looking at: a wooly mammoth!
“The idea is that each layer combines several aspects of the previous one. A circle is a curve in many places, a curve is a line in many places, ”says
David Rolnik , a mathematician at the University of Pennsylvania.
Engineers also have to choose the “width” of each layer, corresponding to the number of different features that the network considers at each level of abstraction. In the case of image recognition, the width of the layers will correspond to the number of types of lines, curves or shapes that the neural network will consider at each level.
In addition to the depth and width of the neural network, there is a choice of the method of connecting neurons in the layers and between them, and the choice of weights for each of the connections.
If you have decided to perform a specific task, how to find out which neural network architecture will be able to perform it in the best possible way? There are fairly general approximate rules. For problems with image recognition, programmers typically use “convolutional” neural networks, a system of connections between layers in which is repeated from layer to layer. For processing natural language — speech recognition or language generation — programmers found that recurrent neural networks are best suited. Neurons in them can be associated with neurons not only from adjacent layers.
However, outside of these general principles, programmers generally have to rely on experimental evidence: they simply run 1000 different neural networks and see which one best fits the task.
“In practice, these choices are often made through trial and error,” said Ganin. “This is a rather complicated way, because there are infinitely many choices, and no one knows which one will be the best.”
The best option is to rely less on trial and error, and more on the pre-existing understanding of what a particular neural network architecture can give you. Several recently published scientific papers have advanced this area in this direction.
“This work is aimed at creating something like a book of recipes for designing a suitable neural network. If you know what you want to achieve with it, then you can find the right recipe, ”Rolnik said.
To lash a red sheep
One of the earliest theoretical guarantees of neural network architecture appeared three decades ago. In 1989, a computer scientist proved that if a neural network has only one computational layer in which there can be an unlimited number of neurons and an unlimited number of connections between them, then the neural network will be able to perform any task.
It was a more or less general statement, which turned out to be quite intuitive and not particularly useful. It is like saying that if you can define an unlimited number of lines in an image, then you can distinguish all the objects with just one layer. In principle, this may be true, but try to put this into practice.
Today, researchers call such wide and flat networks “expressive” because, in theory, they are able to capture a richer set of connections between possible input data (such as an image) and output (such as an image description). At the same time, these networks are extremely difficult to train, that is, it is almost impossible to force them to actually issue these data. They also require more computing power than any other computer has.
Boris GaninRecently, researchers have been trying to understand how far neural networks can be made by going in the opposite direction - making them narrower (fewer neurons per layer) and deeper (more layers). You may be able to recognize only 100 different lines, but with the connections necessary to turn 100 of these lines into 50 curves, which can be combined into 10 different forms, you can get all the necessary building blocks to recognize most objects.
In the
work they completed last year, Rolnik and
Max Tegmark from MIT proved that by increasing the depth and reducing the width, you can perform the same tasks with an exponentially smaller number of neurons. They showed that if the situation you are modeling has 100 input variables, you can get the same reliability, either using
2,100 neurons in one layer, or 2
10 neurons on two layers. They found that there are advantages to taking small parts and combining them at the highest levels of abstraction, rather than trying to cover all levels of abstraction at once.
“The concept of the depth of a neural network is connected with the possibility of expressing something complicated by performing many simple steps,” said Rolnik. “It’s like an assembly line.”
Rolnik and Tegmark proved the usefulness of depth, forcing neural networks to perform a simple task: to multiply polynomial functions. (These are equations with variables raised to natural powers, for example, y = x
3 + 1). They trained the network, showing them examples of equations and the results of their multiplication. Then they instructed neural networks to calculate the result of multiplying equations that they had never seen before. Deeper neural networks have learned to do this with far less neurons than smaller ones.
And although multiplication is unlikely to turn our world upside down, Rolnik says that an important idea was described in the paper: “If a shallow neural network cannot even multiply, you shouldn’t trust it with anything else.”
David RolnikOther researchers are investigating the question of a minimum sufficient width. At the end of September,
Jesse Johnson , previously a mathematician from the University of Oklahoma, and now a researcher working at the pharmaceutical company Sanofi,
proved that from a certain moment no depth can compensate for the lack of width.
To make sense of this, imagine the sheep on the field, but let them be punk rock sheep: the wool of each of them will be colored in one of several colors. The neural network should draw a border around all the sheep of the same color. In fact, this task is similar to the classification of images: a neural network has a set of images (which it represents as points in a multidimensional space), and it needs to group similar ones.
Johnson proved that the neural network will not cope with this task if the width of the layers is smaller or equal to the amount of input data. Each of our sheep can be described by two input data: the coordinates of its location on the field, x and y. Then the neural network marks each sheep with a color and draws a border around sheep of the same color. In this case, to solve the problem, you need at least three neurons per layer.
More specifically, Johnson showed that if the ratio of width to the number of variables is not enough, the neural network will not be able to draw closed loops - and such a loop would have to be drawn by neural networks if, for example, all red sheep had accumulated in the middle of the pasture. “If none of the layers is thicker than the number of input measurements, the function cannot create some form, regardless of the number of layers,” said Johnson.
Such works are beginning to build the germ of the theory of neural networks. So far, researchers are able to make only the simplest statements regarding the relationship between architecture and functionality - and these statements are very few compared to the number of tasks solved by neural networks.
So, although the theory of neural networks will not be able to change the process of their design in the near future, blueprints of a new theory of how computers are being trained are being created - and its consequences will be even stronger than man’s entry into space.