H.264 Magic

H.264 is a video compression standard. And it is omnipresent, it is used to compress video on the Internet, on Blu-ray, phones, surveillance cameras, drones, everywhere. Everyone is using H.264 now.

It should be noted manufacturability H.264. It appeared as a result of more than 30 years of work with one sole purpose: reducing the necessary bandwidth for transmitting high-quality video.

From a technical point of view, it is very interesting. The article will superficially describe the details of the operation of some compression mechanisms, I will try not to get bored with the details. In addition, it is worth noting that most of the technologies described below are valid for video compression in general, and not only for H.264.

Why compress anything at all?

Uncompressed video is a sequence of two-dimensional arrays containing information about the pixels of each frame. Thus, it is a three-dimensional (2 spatial dimensions and 1 temporal) array of bytes. Each pixel is encoded in three bytes - one for each of the three primary colors (red, green, and blue).
')
1080p @ 60 Hz = 1920x1080x60x3 => ~ 370 Mb / s of data.

This would be almost impossible to use. A 50GB Blu-ray disc could hold only about 2 minutes. video. Copying will also not be easy. Even the SSD will have problems writing from memory to disk.

Therefore, yes, compression is necessary.

Why is H.264?

Be sure to answer this question. But first, I'll show you something. Take a look at the Apple homepage:

I saved the image and give the example 2 files:

PNG screenshot of the main page 1015 Kb
5 seconds H.264 video at 60 fps on the same page 175 Kb

Um ... what? File sizes seem confused.

No, the dimensions are fine. H.264 video with 300 frames weighs 175 Kb. One single frame of video in PNG - 1015 Kb.

It seems that we store 300 times more data in the video, but we get a file weighing 5 times less. It turns out H.264 is more efficient than PNG 1500 times.

How is it possible, what is the reception?

A lot of receptions! H.264 uses all the techniques that you guess (and a lot of which are not). Let's go through the main ones.

Get rid of excess weight.

Imagine that you are preparing a car for racing and you need to speed it up. What will you do first? You will lose weight. Suppose a car weighs one ton. You start throwing out unnecessary details ... back seat? Pff ... throw away. Subwoofer? We can do without music. Air conditioning? Not needed. Transmission? In the mus ... stand, it is still useful.

This way you get rid of everything but the necessary.

This method of discarding unwanted portions is called lossy data compression. H.264 encodes with losses, discarding less significant parts and retaining important ones.

PNG encodes lossless. This means that all information is stored, pixel-by-pixel, and therefore the original image can be recreated from a PNG-encoded file.

Important parts? How can the algorithm determine their importance in the frame?

There are several obvious ways to cut the image. Perhaps the upper right quarter of the picture is useless, then you can remove this angle and we will fit in ¾ of the original weight. Now the car weighs 750 kg. Or you can cut the edge of a certain width around the perimeter, important information is always in the middle. Yes, maybe, but H.264 doesn't do it all.

What does H.264 actually do?

H.264, like all lossy compression algorithms, reduces granularity. Below, a comparison of images before and after getting rid of the details.

See how the holes in the speaker grille of the MacBook Pro disappeared on the compressed image? If you do not zoom in, then you can not notice. The image on the right weighs only 7% of the original, and this despite the fact that there was no compression in the traditional sense. Imagine a car weighing only 70 kg!

7%, wow! How is it possible to get rid of the detail?

For a start, a little math.

Informational entropy

We are coming to the fun part! If you have visited the theory of computer science, then perhaps think about the concept of information entropy. Informational entropy is the number of units to represent some data. Note that this is not at all the size of the data itself. This is the minimum number of units that must be used to represent all data elements.

For example, if in the form of data we take one coin toss, then entropy will be 1 unit. If the coin rolls 2, then you need 2 units.

Suppose the coin is very strange - it was thrown 10 times and each time an eagle fell out. How would you tell someone about this? It is unlikely that something like LLCOOOOOOO, you would say "10 throws, all eagles" - boom! You have just squeezed the information! Easy. I saved you from many hours of tedious lecture. This, of course, is a huge simplification, but you have transformed the data into some kind of short presentation with the same information content. That is, reduced redundancy. The informational entropy of the data did not suffer - you just converted the view. This method is called entropy coding, which is suitable for coding any kind of data.

Frequency space

Now, when we have dealt with the informational entropy, we proceed to the transformation of the data themselves. It is possible to present data in fundamental systems. For example, if you use a binary code, they will be 0 and 1. If you use a hexadecimal system, then the alphabet will consist of 16 characters. There is a one-to-one relationship between the above systems, so one can easily convert one to the other. Is everything clear yet? Go ahead.

And imagine that you can imagine data that changes in space or time, in a completely different coordinate system. For example, the brightness of the image, and instead of the coordinate system with x and y, take the frequency system. Thus, the axes will have freqX and freqY frequencies, such a representation is called frequency domain [Frequency domain representation]. And there is a theorem that any data can be represented without loss in such a system with sufficiently high freqX and freqY.

Ok, but what is freqX and freqY?

freqY and freqY are just another basis in the coordinate system. Just as you can switch from binary to hex, you can switch from XY to freqX and freqY. Below is a transition from one system to another.

The MacBook Pro fine grid contains high-frequency information and is located in an area with high frequencies. Thus, small parts have a high frequency, and smooth changes, such as color and brightness, are low. Everything between remains between.

In this view, low-frequency details are closer to the center of the image, and high-frequency in the corners.

While everything is clear, but why is it necessary?

Because now, you can take the image presented in the frequency intervals, and cut corners, in other words, apply a mask, thereby reducing the detail. And if you convert the image back to the usual, you will notice that it remains similar to the original, but with less detail. As a result of this manipulation, we will save space. By selecting the desired mask, you can control the image detail.

Below is a familiar laptop, but now with circular masks applied to it.

The percentage indicates the informational entropy relative to the original image. If you do not zoom in, the difference is not noticeable at 2%! - The car now weighs 20 kg!

This is the way to get rid of weight. This lossy compression process is called Quantization .

This is impressive, what other techniques exist?

Color processing

The human eye is not very good at distinguishing between similar shades of color. You can easily recognize the smallest differences in brightness, but not color. Therefore, there must be a way to get rid of unnecessary color information and save even more space.

On televisions, RGB colors are converted to YCbCr, where Y is a brightness component (essentially a black and white image brightness), and Cb and Cr are color components. RGB and YCbCr equivalents in terms of information entropy.

Why then complicate things? RGB isn't it enough?

At the time of black-and-white TVs, there was only component Y. And with the beginning of the emergence of color TVs, engineers faced the challenge of transmitting a color RGB image along with black-and-white. Therefore, instead of two channels for transmission, it was decided to encode the color into the Cb and Cr components and transmit them along with Y, and the color TV sets themselves will convert the components of color and brightness into their usual RGB.

But here's the trick: the brightness component is encoded in full resolution, and the color components are only a quarter. And this can be neglected, because The eye / brain does not clearly distinguish shades. Thus, it is possible to reduce the size of the image in half and with minimal differences. 2 times! The machine will weigh 10 kg!

This technology of image coding with a decrease in color resolution is called color sub-sampling . It has been used worldwide for a long time and applies not only to H.264.

These are the most significant technologies for size reduction in lossy compression. We managed to get rid of most of the detail and reduce the color information in 2 times.

And you can even more?

Yes. Cropping a picture is just the first step. Up to this point we have analyzed a single frame. It's time to look at the compression in time, where we have to work with a group of personnel.

Motion compensation

H.264 standard that allows you to compensate for movement.

Motion compensation? What is it?

Imagine you are watching a tennis match. The camera is fixed and removes from a certain angle and the only thing that moves is a ball. How would you encode it? Would you do that normally, yes? A three-dimensional array of pixels, two coordinates in space and one frame at a time, right?

But why? Most of the image is the same. The field, the grid, the audience does not change, the only thing that moves is a ball. What if you define a single image of the background and one image of a ball moving through it. Wouldn't this save much space? You see what I'm getting at, aren't you? Motion compensation?

And that is exactly what H.264 does. H.264 splits the image into macroblocks, usually 16x16, which are used to calculate motion. One frame remains static, usually called the I-frame [Intra frame], and contains everything. Subsequent frames can be either P-frames [predicted] or B-frames [bi-directionally predicted]. In P-frames, the motion vector is encoded for each macroblock based on the previous frames, so the decoder should use the previous frames, taking the last of the I-frames of the video and gradually adding changes to the subsequent frames until it reaches the current one.

Even more interesting is the situation with B-frames, in which the calculation is made in both directions, based on the frames going before and after them. Now you understand why the video at the beginning of the article weighs so little, it's just 3 I-frames in which macroblocks are torn.

With this technology, only differences in motion vectors are encoded, thereby ensuring a high degree of compression of any video with movements.

We considered static and temporary compression. With the help of quantization, we have many times reduced the size of the data, then with the help of color sub-sampling we have even halved the received data, and now with motion compensation, we have only 3 frames out of 300 that were originally in the video in question.

Looks impressive. Now what?

Now we draw a line using traditional lossless entropy coding. Why not?

Entropy coding

After lossy compression stages, I-frames contain redundant data. In the motion vectors of each of the macroblocks in the P-frames and B-frames there is a lot of the same information, as they often move identically, as can be seen in the initial video.

Such redundancy can be eliminated by entropy coding. And you can not worry about the data itself, since it is a standard lossless compression technology, which means everything can be restored.

Now that's it! H.264 is based on the aforementioned technologies. This is the standard tricks.

Fine! But I am curious to find out how much our car weighs now.

The original video was shot in non-standard resolution 1232x1154. If you count, you get:

5 sec. @ 60 fps = 1232x1154x60x3x5 => 1.2 GB
Compressed video => 175 Kb

If we correlate the result with the agreed weight of the machine in one ton, then we get a weight equal to 0.14 kg. 140 grams!

Yes, this is magic!

Of course, I presented in a very simplified form the result of a decade of research in this area. If you want to learn more, the Wikipedia page is quite informative.

Source: https://habr.com/ru/post/316580/

All Articles