Differentiable programming

With four parameters, I can ask an elephant, and with five I can make him wiggle his trunk.
- John Von Neumann

The idea of " differentiated programming " is very popular in the world of machine learning. For many, it is not clear whether this term represents a real shift in how researchers understand machine learning, or is it just a (another) rebranding of “deep learning”. This post explains what new is differentiable programming (or ∂P) in the machine learning table.

Most importantly, differentiated programming is a shift opposite to the direction of deep learning; from increasingly heavily parameterized models to simpler ones, which make more use of the structure of the problem.

Next, we leaf through the canvas of uninteresting text, we want to find out what autodifferentiation is and even populate from the catapult!

Brute Force with Benefits

Differentiability is the basic idea that makes deep learning so successful. Where a brute force search of even a few hundred model parameters would be too expensive, gradients allow a pseudo-random walk around interesting parts of the parameter space and find a good set. By performing such a seemingly naive algorithm, we get a good generality, but it’s far from obvious that we need to differentiate, say, working with sequences in language translation, but everything turns out to be simple, show us a little ingenuity.

What about biological neurons and $y = σ (W \ times x + b)$ ? There is nothing special about this formula; This is a simple and flexible example of a high-parametric non-linear function. In fact, this is probably the worst such feature in most cases. A single layer of the neural network can, in principle, classify images of cats, but only using a relatively uninteresting trick acting as a reference table. Works smoothly! - but the fine print warns that you may need more parameters than atoms in the universe. To actually make this thing work, you need to code the problem structure in the model — this is where it starts to look more like traditional programming.

For example, ConvNets have a huge advantage over the perceptron, because they work with image cores that are known to use translational invariance. The face - it is the face, regardless of whether it is displayed in the upper left corner of the image or in the center, but where the perceptron should have studied this case in each particular case, the core can immediately respond to any part of the image. It is difficult to analyze convolutional networks in statistical terms, but it is much easier to treat them as an automatically tuned version of what image processing experts wrote by hand. The core image is the first and simplest differentiated program.

Encoding Structure, Redux

ML toolboxes increasingly support algorithmic differentiation (AD), which allows us to differentiate models using cycles, branching, and recursion — or any program built on a set of differentiated mathematical primitives. This led to a more complex architecture: NLP models are increasingly similar to classical grammar parsers with stack-augmented models, and you can even differentiate an analogue of the Turing machine or programming language interpreter .

The last step taken by differentiated programming is no longer to consider the multiplication of matrices, convolutions, and RNN as fundamental building blocks of deep learning, but only as special cases. We can apply deep learning methods to any parameterized differentiable function. $f (x)$ . Functions as complex as physical simulators or ray tracers can also be differentiated and optimized. Even quantum computing can fit into this structure.

Scientists have long used mechanistic models that are between explicit programming and machine learning. Differential equations with free parameters used in physics, epidemiology or pharmacodynamics are equivalent in all, except terminology, neural networks . They are simply more limited in what functions they can represent, which makes it easier to get a good result.

Indeed, powerful progress is this: all-pervasive differentiability means that all these methods are connected together like lego bricks. Instead of always writing new ML programs, we can incorporate existing programs by adapting physics engines in models of robotics based on deep learning. Where modern reinforcement learning algorithms have to build a detailed model of the external world based only on what they will be rewarded for (sounds like brute force ), we can instead simply omit a detailed, accurate knowledge of physical systems before learning even begins.

Even the most mature areas of deep learning do not stand aside; After the convolution kernel, the next next step for image models is a differentiated ray tracer . 3D rendering contains a lot of structural knowledge about how scenes are displayed in pixels, and this can also be in our melting pot. Let's say a model makes decisions in a simulated environment, displayed as pixels, which the model uses as input. In principle, we can now make the whole cycle differentiable, which will allow us to directly see the influence of the environment on the model's decisions and vice versa. This can significantly increase the power of a realistic simulated environment for training models, such as cars with automatic driving.

As in science, hybrid models can be more efficient and allow some of the trade-offs between deep learning and explicit programming. For example, a UAV flight path planner may have a neural network component that can make only marginally corrective changes to a reliable explicit program, making its overall behavior analyzed, while at the same time adapting to empirical data. This is also good for interpretability: the parameters of mechanistic models and simulations usually have clear physical interpretations, so if the model evaluates the parameters inside, it makes a clear statement about what it thinks is happening outside.

If this is all so wonderful, why hasn’t everyone abandoned and didn’t rush to learn to differentiate? Unfortunately, the limitations of existing frameworks make it difficult to build models of such complexity, and it is impossible to reuse the wealth of knowledge embedded in the existing scientific code. The need to re-implement physical engines from scratch in a very limited modeling language turns a ten-line script into a multi-year research project. But advances in language and compilation technology , especially automatic differentiation , bring us closer to the holy grail: "just differentiate my game engine, please."

So what is differentiated programming?

Differentiated programming applies deep learning methods to complex existing programs, taking advantage of the vast amount of knowledge embedded in them: deep learning, statistics, programming and science — everything that models the world around us and what can be thrust into a particle accelerator . This will improve current models and allow ML to be applied in areas where its current constraints — either interpretability or computational and data requirements — make it impractical alone.

Differentiated control problems

Further, we show that ∂P can lead to some simple but classic control problems in which we usually use a black box for reinforcement learning (RL) . ∂P-models not only teach much more effective management strategies, but also teach several orders of magnitude faster. The code is available for study - in most cases, it learns in a few seconds on any laptop.

Follow the Gradient

Differentiation is almost every step of deep learning; for this function $y = f (x)$ we use a gradient $\ frac {dy} {dx}$ to figure out how changing x will affect y . Despite the mathematical essence, gradients are in fact a very general and intuitive concept. Forget the formulas you had to look at in school; let's do something more fun, like bullet stuff.

When we drop things with a pre-emption, our x (input) is a setting (say, counterweight size or angle of release), and y is the distance the projectile travels before landing. If you are trying to aim, the gradient tells you something very useful - to increase or decrease a certain parameter. To maximize distance, just follow the gradient.

Good, but how do we get the right parameter? But with the help of a clever thing called algorithmic differentiation , which allows you to differentiate not only simple formulas that you have learned at school, but also programs of any complexity — for example, our simulator Trebushet . As a result, we can take a simple simulator written in Julia and the DiffEq diffuro package without deep study, and get the gradients for it in one function call.

 # what you did in school gradient(x -> 3x^2 + 2x + 1, 5) # (32,) # something a little more advanced gradient((wind, angle, weight) -> Trebuchet.shoot(wind, angle, weight), -2, 45, 200) # (4.02, -0.99, 0.051)

Throwing stuff

We need to aim it at the target, using gradients to fine tune the angle of release; Similar things are common under the name of parameter estimates, and we have already considered similar examples . We can make the task more interesting by going to the meta-method: instead of aiming the trebuchet at one goal, we optimize a neural network that can target it at any goal. Here's how it works: the neural network accepts two inputs, the target distance in meters and the current wind speed. The network puts out the required settings (the weight of the counterweight and the angle of disengagement), which are fed into the simulator, which calculates the distance traveled. Then we compare with our goal and move along the whole chain to adjust the weight of the network. Our “data set” is a randomly selected set of targets and wind speeds.

A good feature of this simple model is that learning takes place quickly, because we have expressed exactly what we want from the model, in a completely differentiated way. Initially it looks like this:

After about five minutes of training (on the same processor core of my laptop), it looks like this:

If you want to influence the trajectory, increase the wind speed:

Deviated by 16 cm, or about 0.3%. And how about targeting directly? This is easy to do with a gradient descent, given that we have gradients. However, this is a slow iterative process that takes about 100 ms each time. On the contrary, the work of the neural network takes 5 µs (twenty thousand times faster) with a small loss of accuracy. This trick “approximate inversion of functions through gradients” is very general, and it can be used not only with dynamic systems, but also with a fast style transfer algorithm .

This is the simplest possible management problem that we use mainly for illustrative purposes. But we can apply the same methods, in more advanced ways, to classical RL problems.

Cart, meet pole

A more recognizable control problem is the CartPole , “Hello world” for reinforcement learning. The challenge is to learn how to balance a vertical pillar, pushing its base left or right. Our installation is generally similar to the Trebuchet case: the Julia implementation allows us to directly consider the reward received by the medium as a loss. ∂P allows us to seamlessly switch from a simple model to an RL model.

An insightful reader may notice a snag. The action space for a card board — offset to the left or right — is discrete and, therefore, not differentiable. We solve this problem by introducing differentiable discretization, defined as follows :

f (x) = \ left \ {\ begin {matrix} \, 1, \, x \ geqslant0 \\ -1, \, x <0 \ end {matrix} \ right.

$f (x) = \ left \ {\ begin {matrix} \, 1, \, x \ geqslant0 \\ -1, \, x <0 \ end {matrix} \ right.$

f r a c d f d x = 1

$\ frac {df} {dx} = 1$

In other words, we force the gradient to behave as if $f$ was an identical function. Considering how much the mathematical idea of differentiability is already used in ML, it may not be surprising that we can simply cheat here; for learning, all we need is a signal to inform our pseudo-random crawl of parameter space, and the rest is details. The results speak for themselves. In cases where RL methods need to be trained in hundreds of episodes, before solving a problem, the ∂P models need only about 5 episodes to finally win.

The Pendulum & Backprop through Time

An important goal for RL (reinforcement learning) is to handle deferred remuneration when an action does not help us improve the results several steps in a row. When the environment is differentiable, ∂P allows you to train the agent back-propagating in time, as in a recurrent network! In this case, the state of the environment becomes a “hidden state” that changes between time steps.

To demonstrate this technique, consider a model of a pendulum , where the task is to swing the pendulum until it rises vertically and to keep it in unstable equilibrium. This is tricky for RL models; After about 20 episodes of training, the problem is solved, but often the path to the solution is clearly suboptimal. In contrast, BPTT can beat the ranking of RL leaders in one training episode. It is instructive to observe how this episode unfolds; at the beginning of the recording, the strategy is random, and the model improves over time. The pace of learning is almost alarming.

The model is well suited to handle any starting angle and has something close to the optimal strategy. When restarting, the model looks like this.

This is just the beginning; we will achieve real success by applying DP to environments with which RL is generally too difficult to work with, where rich simulations and models already exist (as in most engineering and natural sciences), and where interpretability is an important factor (as in medicine).

The map is not the territory

The limitation of these toy models is that they equate the simulated learning environment with the testing environment; Of course, the real world is not differentiable. In a more realistic model, the simulation gives us a rough pattern of behavior, which is refined by the data. This data informs, say, a simulated wind effect, which, in turn, improves the quality of the gradients that the simulator transmits to the controller. Models can even be part of the controller’s straight aisle, allowing it to refine its predictions without having to learn system dynamics from scratch. Exploring these new architectures will make exciting future work.

Coda

The basic idea is that differentiated programming, in which we simply write an arbitrary numerical program and optimize it using gradients, is a powerful way to create more advanced models and architectures that are similar to deep learning, especially when we have a large library of differentiable programs at hand. . The models described are just previews, but we hope that they will give an idea of how these ideas can be implemented in a more realistic way.

Just as functional programming involves the reasoning and expression of algorithms using functional patterns, differentiable programming involves expressing algorithms using differentiable patterns. The deep learning community has already developed many such design patterns, for example, to handle management problems or a consistent and tree-like data structure. As the region grows older, much more will be invented, and as a result, against the background of these programs, probably even the most advanced of the existing deep learning architectures will look rude and backward.

Links

Source: https://habr.com/ru/post/459562/

All Articles