How Neural Networks Helped Graphics

In 1943, American neuropsychologists McCulloch and Pitts developed a computer model of a neural network, and in 1958 the first working single-layer network recognized some letters. Now, neural networks are not used for anything: for forecasting the exchange rate, diagnosing diseases, autopilots and plotting graphics in computer games. Just about the last and talk.

Evgeny Tumanov is a Deep Learning engineer at NVIDIA . Following his speech at the HighLoad ++ conference, we prepared a story about using Machine Learning and Deep Learning in graphics. Machine learning does not end with NLP, Computer Vision, recommendation systems, and search tasks. Even if you are not very familiar with this area, you can apply the lessons from an article in your field or industry.

The story will consist of consists of three parts. We will review the tasks in graphics that are solved using machine learning, derive the main idea, and describe the case of applying this idea to a specific task, and specifically in the rendering of clouds .
')

Supervised DL / ML in graphics, or teaching with a teacher in graphics

Let us examine two groups of tasks. To begin with, we briefly denote them.

Real-World or render engine :

Creating believable animations: locomotion, facial animation.
Post-processing rendered images: supersampling, anti-aliasing.
Slowmotion: frame interpolation.
Generation of materials.

The second group of tasks is now conditionally called the " Heavy algorithm ". We include such tasks as rendering complex objects, for example, clouds, and physical simulations : water, smoke.

Our goal is to understand the fundamental difference between the two groups. Consider the problem in more detail.

Creating believable animations: locomotion, facial animation

In the past few years, many articles have appeared , where researchers are proposing new ways to generate beautiful animation. Using the work of artists is expensive, and replacing them with an algorithm would be very beneficial for everyone. Years ago at NVIDIA, we worked on a project in which we dealt with facial animation of characters in games: the synchronization of the face of the hero with the audio track of speech. We tried to “revive” the face so that every point on it moved, and above all the lips, because this is the most difficult moment in the animation. Manually the artist to do it is expensive and long. What are the options to solve this problem and make a dataset for it?

The first option is to determine the vowel sounds: the mouth opens to the vowels, it closes to consonants . This is a simple algorithm, but too simple. In games, we want more quality. The second option is to plant people to read different texts and write down their faces, and then match the letters they pronounce with facial expressions. This is a good idea, and we did so in a joint project with Remedy Entertainment. The only difference is that in the game we show not a video, but a 3D model from points. To collect a dataset, you need to understand how specific points on the face move. We took actors, asked to read texts with different intonations, filmed very good cameras from different angles, and then restored the 3D model of faces on each frame, and predicted the position of points on the face from the sound.

Post-processing rendered images: supersampling, anti-aliasing

Consider a case from a particular game: we have an engine that generates images in different resolutions. We want to render the image in the resolution of 1000 × 500 pixels, and the player will show 2000 × 1000 - so it will be nicer. How to build a dataset for this task?

First, render the image in high resolution, then lower the quality, and then try to train the system to translate the image from low resolution to large.

Slowmotion: frame interpolation

We have a video, and we want to add frames in the middle of the network - to interpol frames. The idea is obvious - to shoot a real video with a large number of frames, remove intermediate ones and try to predict what was removed by the network.

Generation of materials

We will not dwell on the generation of materials. Its essence is that we remove, for example, a piece of wood from several angles of illumination, and interpolate the view from different angles.

We reviewed the first group of tasks. The second is fundamentally different. About the rendering of complex objects, for example, clouds, we will talk later, and now let's deal with physical simulations.

Physical simulations of water and smoke

Imagine a pool in which moving solid objects are located. We want to predict the movement of fluid particles. There are particles in the pool at time t , and at time t + Δt we want to get their position. For each particle, let's call the neural network and get an answer where it will be in the next frame.

To solve the problem, we use the Navier-Stokes equation , which describes the motion of a fluid. For a plausible, physically correct simulation of water, we will have to solve an equation or an approximation to it. This can be done in a computational way, which has been invented a lot over the past 50 years: the SPH, FLIP algorithm or Position Based Fluid.

Difference of the first group of tasks from the second

In the first group, the teacher for the algorithm is something over: a record from real life, as in the case of individuals, or something from the engine, for example, rendering pictures. In the second group of problems, we use the method of computational mathematics. From this thematic division, an idea grows.

main idea

We have a computationally complex task that is long, hard and hard to solve by the classical computing university method. To solve it and speed up, perhaps even losing a little in quality, we need:

find the time-consuming place in the task where the code has been running the longest;
see what this line produces;
try to predict the result of the line using a neural network or any other machine learning algorithm.

This is a general methodology and the main idea is a recipe for how to find an application for machine learning. What should you do to make this idea useful? There is no exact answer - use creativity, look at your work and find it. I am engaged in graphics, and not so well acquainted with other areas, but I can imagine that in an academic environment - in physics, chemistry, robotics - you can definitely find a use. If you solve a complex physical equation in your production, you may also be able to find an application for this idea. For clarity, the ideas consider a specific case.

Cloud Rendering Task

We were involved in this project at NVIDIA six months ago: the task was to draw a physically correct cloud, which is represented as the density of liquid droplets in space.

A cloud is a physically complex object, a suspension of liquid droplets that cannot be modeled as a solid object.

It will not be possible to impose a texture on the cloud and render it, because water droplets are difficult geometrically located in 3d space and are complex in themselves: they practically do not absorb color, but reflect it, and anisotropically - in all directions differently.

If you look at a drop of water that the sun shines on, and the vectors from the eye and the sun are parallel to the drop, you will see a large peak of light intensity. This explains the physical phenomenon that everyone saw: in sunny weather, one of the cloud boundaries is very bright, almost white. We look at the border of the cloud, and the view vector and the vector from this border to the sun are almost parallel.

A cloud is a physically complex object and its rendering by the classical algorithm is very time consuming. We will talk about the classical algorithm a bit later. Depending on the parameters, the process can take hours or even days. Imagine that you are an artist and draw a film with special effects. You have a difficult scene with different lighting, with which you want to play. We drew one cloud topology - I do not like it, and you want to redraw it and immediately get an answer. It is important to get a response from one parameter change as quickly as possible. This is problem. So let's try to speed up this process.

Classic solution

To solve a problem, you need to solve this complex equation.

The equation is harsh, but let's understand its physical meaning. Consider a beam pierced from a camera, piercing through a cloud. How does the light enter the camera in this direction? First, the light can reach the point of ray exit from the cloud, and further propagate along this ray inside the cloud.

For the second method of "propagation of light along a direction," the integral term of the equation corresponds. Its physical meaning is as follows.

Consider a segment within the cloud on the beam - from the entry point to the exit point. The integration is carried out just over this segment, and for each point on it we consider the so-called Indirect light energy L (x, ω) - the meaning of the integral I ₁ - indirect illumination at the point. It appears due to the fact that the drops in different ways re-reflect sunlight. Accordingly, a huge number of mediated rays from the surrounding droplets come to the point. I ₁ is the integral over the sphere that surrounds a point on the ray. In the classical algorithm it is considered using the Monte Carlo method.

Classic algorithm.

Render the image from the pixels, and release the beam that goes from the center of the camera to the pixel and then on.
We intersect the ray with the cloud, find the point of entry and exit.
We consider the last term of the equation: cross, connect with the sun.
Start importance sampling

How to calculate the Monte-Carlo assessment I ₁ we will not analyze, because it is difficult and not so important. Suffice it to say that this is the longest and most difficult part in the whole algorithm.

We connect neural networks

From the main idea and description of the classical algorithm follows the recipe of how to apply neural networks to this problem. The hardest thing is to calculate the Monte Carlo estimate. It gives a number that means indirect illumination at a point, and this is exactly what we want to predict.

We decided on the exit, now we will make out with the entrance - from which information it will be clear what is the magnitude of the indirect light at the point. It is a light that reflects off the multitude of water droplets that surround a point. The amount of light is strongly influenced by the topology of the density around the point, the direction to the source and the direction to the camera.

To construct the entrance to the neural network, we describe the local density. This can be done in different ways, but we focused on the article Deep Scattering: Rendering the Atmospheric Clouds with Radiation Predicting Neural Networks, Kallwcit et al. 2017 and many ideas are drawn from there.

In short, the method of local representation of density around a point looks like this.

Fix a sufficiently small constant . Let it be the free path in the cloud.
Draw around a point on our segment a volumetric rectangular grid of a fixed size , say, 5 * 5 * 9. In the center of this cube will be our point. The grid spacing is a small fixed constant. In the grid nodes we measure the density of the cloud.
Let us increase the constant 2 times , draw a larger grid, and do the same - measure the density at the grid nodes.
Repeat the previous step several times . We did this 10 times, and after the procedure we got 10 grids - 10 tensors, each of which stores the density of the cloud, and each of the tensors covers a larger and larger neighborhood around the point.

This approach gives us the most detailed description of a small area - the closer to a point, the more detailed the description. We decided on the output and input of the network, it remains to understand how to train it.

We train

Generate 100 different clouds with different topologies. We will simply render them using the classical algorithm, write down what the algorithm gets in the very line where it integrates using the Monte Carlo method, and write down the properties that correspond to the point. So we get a dataset, where you can learn.

What to train, or network architecture

The network architecture for this task is not the most crucial point, and if you don’t understand anything - don’t worry - this is not the most important thing I wanted to convey. We used the following architecture: for each point there are 10 tensors, each of which is counted on an increasingly large-scale grid. Each of these tensors falls into the corresponding block.

First, in the first ordinary fully connected layer .
After exiting from the first fully connected layer, to the second fully connected layer, which has no activation.

Fully connected layer without activation is just a multiplication by a matrix. We add the output from the previous residual-block to the multiplication result on the matrix, and only then apply activation.

We take a point, count the values on each of the grids, put the obtained tensors into the corresponding residual block — and we can carry out the inference of the neural network — the production mode of the network. We did this and made sure that we get pictures of clouds.

results

The first observation is that we got what we wanted: the neural network call, compared with the Monte Carlo evaluation, works faster, which is already good.

But there is another observation on the results of training - it is the convergence in the number of samples. What is this about?

When rendering the image, let's cut it into small tiles — small squares of pixels, say, 16 * 16. Consider one image tile without loss of generality. When we render this tile, for each pixel of the camera, we emit many rays corresponding to one pixel, and add a little bit of noise to the rays, so that they are slightly different. These rays are called anti-aliasing and invented to reduce the noise in the final image.

We produce several anti-aliasing rays for each pixel.
On the inside of the beam from the camera, in the cloud, on the segment, we calculate n samples of points where we want to carry out Monte-Carlo evaluation, or call a network for them.

There are also samples that correspond to the connection with light sources. They appear when we connect a point with a light source, for example, with the sun. This is easy to do, because the sun is the rays falling on the earth parallel to each other. For example, the sky, as a source of light, is much more complicated, because it is represented as an infinitely distant sphere, which has a function of color in direction. If the vector looks straight up in the sky, then the color is blue. The lower - the brighter. At the bottom of the sphere is usually a neutral color, imitating the earth: green, brown.

When we connect a point with the sky in order to understand how much light comes into it, we always release several rays in order to get an answer that converges to the truth. Rays release more than one to get a better grade. Therefore, the entire pipeline rendering requires so many samples.

When we trained the neural network, we noticed that it learns a much more average solution. If you fix the number of samples, it is clear that the classical algorithm converges to the left row of the picture column, and the network learns to the right. This does not mean that the original method is bad - we just converge faster. When we increase the number of samples, the original method will be getting closer and closer to what we get.

Our main result that we wanted to get is an increase in rendering speed. For a specific cloud in a specific resolution with the parameters of the samples, we see that the pictures obtained by the network and the classical method are almost identical, but we get the right picture 800 times faster.

Implementation

There is an Open Source program for 3D modeling - Blender , which implements the classic algorithm. We did not write the algorithm ourselves, but used this program: we conducted training in Blender, writing down everything we needed for the algorithm. Production was also done in the program: they trained the network in TensorFlow , transferred it to C ++ using TensorRT, and already TensorRT-network was integrated into Blender, because its code is open.

Since we did everything for Blender, our solution has all the features of the program: we can render any scenes and a lot of clouds. Clouds in our solution are defined by creating a cube, inside which we determine the density function in a specific way for 3D programs. We optimized this process - we cache the density. If a user wants to draw the same cloud in a heap of different scene setups: under different lighting, with different objects on the scene, then he does not need to constantly recalculate the density of the cloud. What happened, you can look at the video .

Finally, I will repeat once again the main idea that I wanted to convey: if in your work you long and diligently consider something as a specific computational algorithm, and this does not suit you, find the hardest place in the code, replace it with a neural network, and Perhaps this will help you.

Neural networks and artificial intelligence is one of the new topics that will be discussed at Saint HighLoad ++ 2019 in April. We have already received several applications on this topic, and if you have great experience, not necessarily on neural networks, submit an application for a report before March 1 . We will be glad to see you among our speakers.

To be aware of how the program is formed and what reports are accepted, subscribe to the newsletter . In it we only publish thematic collections of reports, digests on articles and new videos.

Source: https://habr.com/ru/post/441260/

All Articles