
In this post, we will discuss video coding issues “on an industrial scale” using the h264 video codec integrated into modern Intel processors and the experience that our company acquired by Inventos in the process of creating and optimizing a media server for streaming video processing.
Introduction
So, the task was to develop a media server, which is a kind of "combine" for all occasions and who knows the following:
- Encoding / re-sampling audio / video in almost all known formats, including HLS, HDS, RTMP, mp4, etc;
- Support for signal removal from SDI, DVB;
- Balancing and scaling distribution and encoding servers;
- A description of the encoder configuration in the embedded language;
- Various modules for sound normalization, amplification, video deinterlacing, etc.
This product, code-named "streambuilder" is the backend of the Webcaster.pro media platform. The solution is written in C ++ using the libavcodec libraries into which many well-known codecs are implemented, such as h264, mpeg4, etc. This solution allows you to quickly deploy the content delivery infrastructure of any complexity and is flexible in terms of configuration.
Our media server, and, more precisely, the configuration language, allows you to present all the stages of audio / video processing in the form of a directed acyclic graph. At the root of the peak is the source of the signal, and the rest are the various blocks that conduct operations on data. With the help of configuration files, we can describe graphs that satisfy almost any need. In a strongly simplified form, this graph will look as follows:
')

Although the libavcodec library code is well optimized, it is designed to run on a CPU that contains a finite number of executive devices, such as x86 based. Increasing the number of cores only partly solves the problem, as it is expensive, and the cores always have something to download, besides video encoding. And a logical step was an attempt to use the capabilities of graphics accelerators to solve this problem.
Now a little about the video encoding. As you know, the basis of video compression are a lot of algorithms from different sections of mathematics. This is Fourier analysis, wavelets, operations with matrices, vectors, probabilistic algorithms, etc. One thing unites them - they all work in one way or another with video data, which are nothing but vectors and matrices. We will not go into particular compression with specific video codecs, but it should be understood that such tasks are extremely costly in terms of memory and especially CPU time. Especially when it comes to industrial scale coding. Until recently, the coding process was carried out on central processing units. CPUs are characterized primarily by a limited number of execution units. And even despite the fact that many cores have appeared in modern processors, the number of execution units remains finite. Of course, processor manufacturers add elements of superscalar architecture to their chips (for example, Intel's SSExxx / AVX instruction sets). They allow one instruction to process several numbers of the same type, but there are always tasks such as video compression, when there are still not many resources, especially in light of the emergence of new broadcast standards (HD, 4k, etc.). GPU, in turn, have a large number of execution units, which, however, are able to execute a limited number of instructions. Accordingly, GPUs are ideally suited for processing data of the same type using parallel algorithms. In addition, many GPU manufacturers add additional commands to their video accelerators to speed up the processing of multimedia data.
Intel solution
As an experiment, we decided to try to speed up video encoding with the help of Intel HD Graphics graphics co-processors built into modern Intel processors. Intel kindly provides its Media SDK for encoding, decoding, resampling and other video processing algorithms. This SDK, to our great joy, is now available for Linux, which is essential for industrial use. Thanks to the advent of Linux support, we became interested in this solution. Intel colleagues were also interested in the results of practical use of this SDK in industrial use. At the same time, I should note that throughout the entire development period, Intel employees helped us a lot, answered questions (of which there were many at first) and gave really valuable advice.
Included with the Media SDK is good documentation and examples for almost all occasions. The process of integrating the Intel Media SDK greatly simplified the availability of examples; without them, it must be said, it would seem not the most trivial. The essence of the integration was to replace the most demanding hardware software modules for encoding / decoding / resampling with appropriate modules using the hardware capabilities of Intel HD Graphics.
Tests
The configuration of our test bench: Processor i7-3770 3.4 GHz, 3 GB memory. Intel Media SDK for Linux Servers version is 16.1.0.10043. Different media files were taken as a source and the result was averaged.
Encoding RAW video to h264
| ffmpeg | intel media sdk |
720x608, h264 RAW, 30 seconds, 3000 kb / s | 1.2 sec | 0.24 seconds |
An example of transfoding with the demuxer / muxer ffmpeg (audio packets were written without transcoding)
| ffmpeg | intel media sdk |
720x608, aac + h264, 30 s, 3000 kb / s | 1.6 sec | 0.32 seconds |
streambuilder
| simple streambuilder | with intel media sdk |
1920x800 + resize 1280x720, 3000 Kb / s h264, aac, 12 minutes | 5.4 min | 2.3 min |
Now look at the results. To begin, we collected examples that came in the kit. In the RAW video encoding in h264, the test program overtook the ffmpeg h264 encoder with similar parameters by 4-5 times, while making an amendment to the time taken for disk write / read operations (that is, the actual result is slightly higher). Fmpeg at the same time 100% loaded all the CPU cores. Video decoder and hardware image resizing showed similar results. We also collected an example of transcoding with the demxer / muxer ffmpeg. This example uses a special type of memory for working with frames and in it frames are transferred from the decoder to the encoder bypassing the stage of copying data from the SDK memory to the YUV format in the system memory. Accordingly, here the Intel Media SDK showed an average performance 5 times higher than similar transcoding using ffmpeg.
Our mediaconverter together with the hardware encoding modules showed an increase in performance by 2-3 times. And this is the expected result, since most of the performance was lost in memory copy operations. Gprof showed that 80% of CPU time was spent on working with memory, which caused delays when frames were fed to a hardware encoder / decoder. However, we plan to close this problem in the near future, using directly the memory structures of the Intel SDK when exchanging packets between different modules. We expect a significant increase in performance here.
So, as advantages of the decision from Intel it is possible to allocate the following.
- The hardware encoder (for h264) more adequately perceived encoding parameters than software (libavcodec). For example, the bit rate was more accurately maintained when encoding. For the ffmpeg API, it needs a long and painful selection;
- When coding is almost not used processor. In our tests, the CPU was used for transcoding the audio stream and memory operations inside the pipeline. The total load was at the level of 25% (approximately one core out of 4 was loaded).
Of the minuses of this solution, we can note perhaps a slightly larger threshold for entering and linking the SDK to a specific Linux kernel, but this did not seem critical to us.
What's next
As a result, I can note that the entire product infrastructure can be transferred to a new API in about 1 man-month (excluding testing). In addition, we used the old API. In the new version of the SDK, many new buns appeared, such as the analogue of two-pass encoding on 100 frames, the calculation of the dts of the coded packet, the increase in video quality, etc. For the tests, we used a regular desktop computer with a fairly productive CPU. In the near future, we plan to try a new version of the API, as well as conduct tests on server solutions. In addition, we plan to conduct advanced tests on various sets of parameters to obtain a more complete picture, the results of which we will share as soon as possible.
UPD
Here are two videos: the first is obtained by encoding with the help of ffmpeg, the second with the help of Intel Media SDK Both there and there almost all parameters are by default, the bitrate is 8000 Kb / sec
webcaster.pro/video1.mp4webcaster.pro/video2.mp4