CPU vs GPU. Distance field

Hello to all. I already once wrote about Distance Field and gave the implementation of a "heuristic" code that gives a good speed: "Honest glow and speed . "

Why is it needed?

DField can be applied:

To significantly improve the quality of fonts
For effects like contour burning. I gave one of the effects in my previous article.
For the “metaballs” effect, but in 2D and for any complex shapes. (maybe I'll ever give an example of the implementation of this effect)
And at the moment I need DField for high-quality smoothing of corners and removal of small details.

And if in the first two cases we can calculate the DField in advance, then for other effects we need to calculate it in real time.
The article will consider the most popular, I would say the classic Chamfer distance (CDA) with a bunch of pictures explaining its principle of operation, as well as a two-pass algorithm on the GPU.
Both algorithms are implemented in FPC demonstration programs .

Chamfer Classic on CPU

Today I would like to draw attention to the classic way of calculating DField. You can play here (source code in git). The algorithm has an unstable name: chamfer distance algoritm . ~~This method has already described the RPG in his topic .~~ ~~However, there is no explanation why and how this code works.~~ ~~Why there are errors, what they are, and how to reduce them.~~
upd. The RPG has a modification of the CDA algorithm that successfully fights errors.

Under the hood at chamfer

So, we have a monochrome image from which we want to build a Distance Field. This is a white dot on a black background:

')
Let's start building a DField. To do this, first fill our future image + infinity for voids, and zeros for the object. Something like this:

Then we start to run along the image (from left to right and from top to bottom), and for each pixel to check its neighbors. Take the value in the adjacent pixel, add to this the distance to the neighbor. For pixels on the right and left, this distance is 1. For pixels on the diagonal, sqrt (1 * 1 + 1 * 1) = sqrt (2). Let's write in our pixel the minimum value between the current one and the one just calculated by the neighbor. After we do this for all neighbors - go to the next pixel. We got this picture:

As we ran from left to right and from top to bottom, the distances accumulated from previous calculations, and the pixels marked with red arrows automatically counted "true." Now, if we run in the opposite direction (from right to left, from bottom to top), then the missing part of the map will be filled.

So, the main idea is to use the already calculated early distance. To do this, we do not need to check all the neighbors. It is enough to check the pixels for which we have already run:

Red - pixels of neighbors for the first pass are marked (right, down). Blue - for the second (left, up).

The first pass will give us a beautiful trail in one direction:

The second will give to the other, and since we will record only the minimum value in both cases, then in sum it will look like this:

O (2 * W * H * 5) algorithm complexity

Now let's talk about accuracy. There are no problems on the current image, but if you try to build a contour (as in this article ), then the result will not be the most plausible. Let's look at the distance% 16 contour:

Red arrows, I noted the pronounced octagon image. Why is this happening? Everything is simple, because our result accumulates, and it accumulates incorrectly:

Our distance will be calculated by the red line: 1 + sqrt (2), while it will be correctly calculated by the blue line: sqrt (2 * 2 + 1 * 1) = sqrt (3). If we take in the calculations not only neighbors, but also neighbors neighbors:

The result, of course, will be better. But computational complexity increases from O (2 * W * H * 5) to O (2 * W * H * 13). But there is good news, we can throw out some of the neighbors, without affecting the result:

The fact is that the discarded neighbors have a multiple distance and lie on the same beam. So we can throw them away, because the value from the nearest ones will correctly accumulate. I will call the matrix of neighbors a kernel. In the first case, we had a 3x3 core. Now the core is 5x5. Let's look at the result:

Our octogonality has become noticeably “rounder”. (By the way, in Photoshop for layers there is an Outer glow effect with the precise parameter. I have the result exactly the same after the 5x5 core passes. It looks like they use this algorithm for this effect.)
Difficulty for optimized 5x5 = O (2 * W * H * 9)
You can continue to continue to increase and increase the size of the core, the quality will increase, but we will not be able to overcome one unpleasant effect. Here is the image for the kernel 13 * 13:

On the image there are stepped gradients. They are still associated with the notorious error, and in order to completely get rid of them we would have to create a kernel the size of two image widths, since a diagonal with sides (Many; 1) can cover only a core of 2 * Many + 1. (various modifications of CDA struggle with these errors, one of the options is to store the vector in a pixel to the nearest one, but I will not touch them in the article).

So summarily, the advantages of the algorithm

Unlimited Distance Field (we understand what it is when we deal with the algorithm on the GPU).
Linear complexity, and high speed for small cores.
Ease of implementation.

Minuses

Poor accuracy (especially for small cores)
Dependence on previous calculations (no SIMD to you)

GPU into battle

Unfortunately, I have not found any popular algorithms that fall on the multi-threaded GPU architecture. Either my hands are wrong, or Google is squeezed. Therefore, I will share with you my "bicycle". Fortunately it is simple. You can play here , the source is in git. The game will require a video card that supports OpenGL3 and Rect textures.

So, let's say it is important for us to calculate Distance Field within a radius of 40 pixels. The first pass, we consider only the vertical distance from 40 neighbors above and from 40 neighbors below for each pixel. We write down the minimum value (if there are no filled neighbors - we write down the maximum value, 40). We get this picture:

After this, we make the second passage horizontally. Similarly, 40 neighbors left, 40 neighbors right. Knowing the distance to the neighbor, + vertical distance from the neighbor - we easily consider the hypotenuse:

In the result we write the minimum hypotenuse. After the second pass, we are waiting for a beautiful and well-constructed picture:

The complexity of the algorithm is O (2 * W * H * (2 * D + 1)), where W and H are the dimensions of the image, and D is the distance of the distance field. The distance field we get is normalized (in the image there will be no values greater than D).
The accuracy of the calculations is excellent, because errors do not accumulate:

In the annex to the article there are three modes: GL_R8, GL_R16, GL_R32F. If you turn on GL_R8 and twist distortion, then you can see jumping errors. The point here is this. For an intermediate vertical DField, the storage texture is 8 bits per pixel. Since I normalize the distance to the range [0; 1], errors arise. It is necessary either to keep the distance not normalized (but then for an eight-bit texture it will be limited to 255 pixels), or to increase the bit depth of the texture. For the texture of R16, these errors are smaller, and for R32F these errors are generally “out of place”, since This is a float texture. Consider this if you want to implement a similar distance field.

So, the pros:

Absolute accuracy
Perfectly fits SIMD.
Ease of implementation
The data immediately lie on the GPU!

Minuses

Limited distance. We need to know what distance we need in advance.
The complexity of the algorithm depends on the distance.
Data on the GPU (this may require additional time to get if we plan to continue working with this data on the CPU)

Findings?

I have a GF450 in my home computer. For the Hello world and DField = 500 pixels image, my GPU manages to do 20 frames per second, which is approximately 50ms per frame. All this with excellent quality and simple transparent code. On a 3x3 CPU, the distance field is calculated to be approximately 100ms. 5x5 about 200ms. Even if I accelerate, optimizing for the CPU 4 times (which is very cool), then I will get the quality noticeably worse in the same time. Use the GPU if the algorithms allow it.

CPU vs GPU. Distance field

Why is it needed?

Chamfer Classic on CPU

Under the hood at chamfer

So summarily, the advantages of the algorithm

Minuses

GPU into battle

So, the pros:

Minuses

Findings?

Related Links

More articles: