
The current situation in the field of media codecs can be described in just a few words: simple solutions have exhausted themselves. Every year the material for coding is becoming increasingly difficult, and the requirements for the quality of the result are getting higher and higher. Under these conditions, when a frontal attack no longer produces an effect, the optimization of coding and media playback for specific platforms using their most advanced features is of particular importance. We can show what can be achieved by such optimization using the example of the promising H.265 codec. As a target platform, consider an Intel server solution — the Xeon processor.
Brief Description of H.265 / HEVC
H.265 / HEVC (High-Efficiency Video Coding) is the latest video codec standard developed jointly by the International Telecommunication Union ITU-T and ISO / IEC. The goal of this standard is to increase compression efficiency and reduce data loss. H.265 / HEVC, compared with the previous standard H.264 / AVC, has twice the degree of compression with equal subjective image quality. HEVC technology allows video providers to stream high-quality video with less network load.
Note the main functional innovations applied in H.265:
- Special features for random access and splicing digital streams. In H.264 / MPEG-4 AVC, the digital stream must always begin with an IDR addressing block, and random access is supported in HEVC.
- The image is divided into coding tree units (CTU), each of which contains coding tree blocks (CTB) of luminance and chroma. All previous video coding standards used a fixed array size for luma samples — 16 × 16. HEVC supports CTB blocks of various sizes, which are selected depending on the needs of the encoder in terms of memory and processing power.
- Each coding block (CB) can be recursively divided into conversion blocks (TB). The separation is determined by the residual quadtree. Unlike previous standards in HEVC, a single TB block can span multiple prediction blocks (PBs) for cross predicted coding units (CUs).
- Directional prediction with 33 different orientation directions for transformation blocks (TB) ranging in size from 4 Ă— 4 to 32 Ă— 32. The possible direction of prediction is all 360 degrees. HEVC supports various intraframe prediction encoding techniques.
H.265 / HEVC imposes extremely high computational power requirements on both client devices and internal transcoding servers.
HEVC Performance Issues
The existing
HEVC Test Model (HM) project implements only the core functionality of the standard; actual performance is still far from what is needed in a real environment. Two main disadvantages of this project:
- Lack of parallel circuit.
- Inefficient setting vectorization.
Figure 1. HM project profile - parallel threading')
Figure 2. HM project profile - resource intensive codeThis HEVC codec consumes, compared to H.264, 100 times more CPU resources on the server side and 10 times more on the client side.
The H.265 / HEVC codec attracted the attention of many companies and organizations around the world, which resulted in optimizing its performance and actual development. There are several open source projects.
- OpenHEVC (compatible with HM10.0, decoder optimization)
- x265 (compatible with HM, parallelization and vectoring)
To evaluate the performance of the x265 encoder on a platform with Intel® Xeon® processors (E5-2680, 2.7 GHz, 8 * 2 physical cores, codenamed Sandy Bridge), we launched a video with a resolution of 720p and a frequency of 24 frames per second. The x265 developers have done a lot of work to optimize the original standard in order to parallelize the processing of tasks and data. However, our test showed that the codec can use only 6 cores in a system with 32 logical cores (with SMT enabled). Thus, the codec is far from fully using the resources of modern multi-core platforms.
Figure 3. CPU load in X.265 project
Figure 4. X.265 Project with Intel® SIMD ConfigurationThe x265 project also used Intel® SIMD instructions (autogeneration by the compiler), which increased performance by more than 70%. Together with further optimization of the compiler options, the Intel compiler provides a doubling of performance on the IA platform. However, the performance of the encoder is still significantly lower than what is required for a real-time encoder, especially for high-definition video with a resolution of 1080p.
Below we show the results achieved by the Chinese company Strongene with the support of Intel specialists on the way to optimizing
the H.265 / HEVC codec created by it for various Intel platforms.
HEVC Optimization for Intel® Xeon ™ Platform
The main part of the most resource-intensive video and image processing functions consists of intensive block data calculations. To optimize them, you can use Intel® SIMD vectorization instructions. In the encoder as part of the Strongene codec, according to the profiling data, using Intel SSE instructions, you can manually vectorize all the most demanding functions, such as low-complexity frame interpolation with motion compensation; integer conversion without transposition; Hadamard transform; calculating the sums of absolute differences (SAD) / squared differences (SSD) with the least redundant memory use. We have included Intel SSE instructions in the form of intrinsic functions, as shown in Fig. five.
Figure 5. Example of enabling Intel® SIMD / SSE instructions in the Stongene codecThe developers of Strongene rewrote all resource-intensive functions in order to achieve the greatest increase in the performance of the encoder. In fig. Figure 6 shows our profiling data in 1080p video coding scenarios using HEVC. It can be seen that 60% of resource-intensive functions are processed by Intel SIMD instructions.
Figure 6. Profiling Strogene coding functionsIntel AVX2 instructions with calculating 256-bit integer values ​​have twice the performance compared to the previous Intel SSE code, working with 128-bit values. Intel AVX2 instruction set supported by platform
Intel Xeon (Haswell), the release of which began in 2014. To assess the performance of the integrated functions of the Intel AVX2, we use the common calculation of the sum of absolute differences for the 64 * 64 block.
Table 1. Intel® SSE and Intel® AVX2 Implementation ResultsCPU cycles | Source | Intel® SSE | Intel® AVX2 |
---|
Run 1 | 98877 | 977 | 679 |
Run 2 | 98463 | 1092 | 690 |
Run 3 | 98152 | 978 | 679 |
Run 4 | 98003 | 943 | 679 |
Run 5 | 98118 | 954 | 678 |
The average | 98322,6 | 988.8 | 681 |
Acceleration | 1.00 | 99.44 | 144.38 |
As can be seen in Table 1, the use of Intel SSE and Intel AVX2 instructions provides a 100-fold increase in performance, while the Intel AVX2 code wins an additional 40% compared with Intel SSE.
As we saw earlier, in most existing implementations, not all cores of multicore platforms are used. Building on the latest Intel Xeon multi-core architecture with parallel dependencies between CTB-based algorithms, Strongene developers propose replacing the original OWF and WPP methods with a parallel IFW structure and then developing a three-level flow control scheme to ensure that the IFW structure of all CPU cores is used to accelerate HEVC coding .
Figure 7. Parallel streaming and CPU usage in the Strongene encoderDue to the use of the WHP parallel structure and the full implementation of Intel SIMD instructions, respectively, at the task level and data level, the Strongene encoder developers managed to achieve a very significant performance improvement on x86 processors for 1080p video using the computing resources of all cores, as shown in Fig. eight.
Further configuration using SMT / HT
Also of interest is the dependence of the performance of the codec on the inclusion of simultaneous multithreading (SMT) in the system, which is widespread on all platforms with the Intel architecture, also called hypertreaming technology (HT).
Table 2. Encoding speed Strongene HEVC on Intel® Xeon® platform
As can be seen from the table (shown in yellow) on the Ivy Bridge platform (Intel Xeon E5-2697 v2 processor for disabled SMT HEVC video coding with 1080p resolution is performed in real time!
Having achieved a tremendous increase in performance, we continued to explore the coding capabilities of Strongene HEVC on the Ivy Bridge platform, focusing on flow rates and quality issues.
Table 3. Performance comparison of H.264 and H.265 codecs
Table 3 shows that the H.265 / HEVC codec reduces the amount of data by 50% while maintaining the same video quality.
H.265 / HEVC is likely to become the most popular video standard in the next decade. Multiple applications and multimedia products currently support HEVC. In this document, we implemented a full-featured, real-time HEVC solution based on CPU on Intel platforms with new IA technologies. Our optimized solution based on Intel processors is deployed at Xunlei, a company that provides video services over the Internet, and will contribute to the widespread adoption and distribution of H.265 / HEVC technology.