Introducing the fastest VP8 decoder in the world: ffvp8

Even at the time when I wrote the initial review of the VP8 , I noticed that the official decoder, libvpx , is very slow. There are no special reasons why it should be noticeably faster than a good H.264 decoder, but there is nothing to be so slow with it too! So I had a plan to write the best version for FFmpeg with Ronald Balti (Ronald Bultje) and David Conrad (David Conrad). This implementation of the decoder had to be developed by the community and be free from the very beginning, in contrast to the landfill of the proprietary code that the libvpx library represented. A few weeks ago, the decoder was sufficiently complete to ensure binary compatibility of the video stream with libvpx, which made it the first independent and free implementation of the VP8 decoder . Now that we have completed the first optimization cycle, it should have been ready for use in real conditions. I will talk about the details of the development process later, and now let's move on to the salt of this post: the results of comparative testing of codec performance.

We tested the decoder on two 1080p clips: Parkjoy , shot live, and Sintel trailer , created on the computer. Testing was performed as follows:

time ffmpeg -vcodec {libvpx or vp8} -i input -vsync 0 -an -f null -

We used the latest at the time of this post assembly FFmpeg from SVN, the last revision containing the optimization of the VP8 decoder was r24471.

As these charts show, ffvp8 is much faster than libvpx, especially on 64-bit platforms. Even on Atom processors, it works significantly faster, despite the fact that we haven’t even optimized it specifically for Atom. In many cases, this difference in performance will depend on whether the video is played or not, especially in modern browsers, the engines of which eat up a large part of the processor resources. Want a VP8 video to play faster? New versions of players based on FFmpeg (and this is the well-known VLC and many others) will include the ffvp8 library. Do you want VP8 video to be decoded faster in your browser? Communicate with its developers, and insist that they use ffvp8 instead of libvpx. I believe that Chrome will be the first to use ffvp8, since they already use libavcodec in their video playback subsystem.

Remember that the development of ffvp8 does not end there, we will continue to improve and accelerate it. We still have a queue of optimizations that have not yet been included in the main branch of development.

Ffvp8 development

The first task that David and Ronald took up was to recreate the decoder core and bring it to binary stream compatibility with libvpx. It was not easy, given the incomplete official specification. Many parts of the specification were generally incorrect and were in conflict with the libvpx code. And of course the fact that the set of official compatibility tests does not even cover all the features that the official coder uses does not help our work! In order to somehow work further in this state of affairs, we had to start adding our own tests. But I already complained about the insufficient quality of the specifications in my previous posts, so let's move on to the nuances.

The next step was to add the SIMD code for all the important functions of the DSP . Basically, the processor load in the VP8 decoder is created by motion compensation and the deblocking filter ( encoding artifact compensation, trans. ) - just like in H.264. But, unlike H.264, the deblocking filter relies on internal arithmetic with saturation , which is worthless in the SIMD implementation, but rather “gluttonous” with respect to the processor in the implementation in C. Of course, neither presents a serious problem , as in all normal codecs, these processes are implemented in the form of a SIMD code.

I helped Ronald with SIMD for x86, and also wrote most of the motion compensation, internal prediction, and part of the inverse transforms. Ronald wrote the remainder of the inverse transforms and some of the motion compensation. In addition, he did the most difficult part: the deblocking filter. These filters are always the hard part, as they are different in each codec. The implementation of motion compensation, for comparison, is usually not too different in different codecs: a 6-tap filter will in any case be a 6-tap filter, and the difference is usually only in coefficients.

The biggest difficulty in the SIMD deblocking filter was to avoid “unpacking”, i.e. Transition from 8 bits to 16. Many of the operations in such filters initially seem to require precision greater than 8 bits. A simple example for x86: abs (ab), where a and b are unsigned 8-bit integers. The “ab” result requires a precision of 9 bits with a sign (since it can be anywhere from -255 to 255), so that it cannot fit in 8 bits. But it is quite possible to solve this problem without “unpacking”: (satsub (a, b) | satsub (b, a)), where “satsub” calculates the difference with saturation between two values. If the difference is positive, the result is returned, otherwise - zero, so that the implementation of a logical ILI between the results of the work of these functions just gives us what we need. This requires 4 assembler instructions on x86, “unpacking” would require at least 10, including the actual “unpacking” and “packing” steps.

This was followed by a SIMD C code optimization, the execution of which still took a significant part of the decoding time. One of my biggest optimizations was adding a smart preload to reduce the cache miss. ffvp8 prefetches the frames referenced by the current (“PREVIOUS”, “GOLD” and “ALTERNATIVE REFERENCE”, they are also PREVIOUS, GOLD and ALTREF), but only when they are really used in this frame. This allows us to pre-request everything that we need and not to request what we hardly use. libvpx, as a rule, encodes frames that almost never (but do not understand it as “absolutely never”) do not use GOLDEN or ALTREF frames, so this optimization significantly reduces the time spent on pre-queries in many real videos. In addition, we have done so many optimizations in different parts of the code that we cannot list them all, for example, the optimization of the entropy decoder that David did. I would also like to thank Eli Friedman for his invaluable assistance in testing the performance of most of these improvements.

What's next? Altivec assembly code ( PPC ) is actually absent, there are only a few functions from David’s motion compensation code. There is no assembly code for NEON ( ARM ) at all, and we need it to work quickly and on mobile devices. Of course, all this will happen with time, and, as usual, we are always happy with patches!

Appendix: bare digits

Here are the numbers that correspond to the graphs above, in frames per second and with standard errors :

Core i7 620QM (1.6Ghz), Windows 7, 32-bit:
Parkjoy ffvp8: 44.58 ± 0.44
Parkjoy libvpx: 33.06 ± 0.23
Sintel ffvp8: 74.26 ± 1.18
Sintel libvpx: 56.11 ± 0.96

Core i5 520M (2.4Ghz), Linux, 64-bit:
Parkjoy ffvp8: 68.29 ± 0.06
Parkjoy libvpx: 41.06 ± 0.04
Sintel ffvp8: 112.38 ± 0.37
Sintel libvpx: 69.64 ± 0.09

Core 2 T9300 (2.5Ghz), Mac OS X 10.6.4, 64-bit:
Parkjoy ffvp8: 54.09 ± 0.02
Parkjoy libvpx: 33.68 ± 0.01
Sintel ffvp8: 87.54 ± 0.03
Sintel libvpx: 52.74 ± 0.04

Core Duo (2Ghz), Mac OS X 10.6.4, 32-bit:
Parkjoy ffvp8: 21.31 ± 0.02
Parkjoy libvpx: 17.96 ± 0.00
Sintel ffvp8: 41.24 ± 0.01
Sintel libvpx: 29.65 ± 0.02

Atom N270 (1.6Ghz), Linux, 32-bit:
Parkjoy ffvp8: 15.29 ± 0.01
Parkjoy libvpx: 12.46 ± 0.01
Sintel ffvp8: 26.87 ± 0.05
Sintel libvpx: 20.41 ± 0.02

Translator's notes

Some terms have remained a mystery to me, for example, if someone can tell the correct Russian translation of the 6-tap filter, I will be very grateful.

Below, in the comments to the note, the author gives answers to some questions from readers, some of which I found it appropriate to quote here. This is not a direct translation of questions and answers, but rather a brief summary of their essence.

Q: Can ffvp8 use enhancements made to libvpx?
A: In fact, all the optimizations that seemed interesting were already taken from there. But we must understand that a simple merge (merge) of the source code does not work here, since the architecture of the decoders is fundamentally different.

Q: Is there any danger that ffvp8 will not be able to maintain compatibility with the experimental libvpx development branch?
A: Such a task is not worth it, since at the moment the experimental branch is not intended for use in real conditions. Even the compatibility of the experimental branch with the current libvpx is not guaranteed.

Q: Who sponsors FFmpeg development?
A: The whole project - no one, but some developers get paid for implementing the features needed by specific customers. As far as the author knows, the development of ffvp8 was completely non-commercial.

Q: Is the increase in performance due to one kind of global lack of libvpx, or is it just that there are a lot of optimizations here and there?
A: In general, rather the second. But the main performance increase was due to the fact that libvpx passes through the frame several times (all previous On2 codecs do the same), and ffvp8 does all the operations in a single pass.

Q: Do you plan to develop your own VP8 encoder in FFmpeg?
A: This is a very big job, and, honestly, I doubt that it will ever be done. In fact, the only “native” encoder that is in FFmpeg is the mpeg encoder, and there is hardly any way to make a VP8 encoder based only on the existing framework, in any case, this method will not be easy. But, of course, if someone wants to try ...

Q: But if for FFmpeg the only native encoder is mpeg, then how does this library support video encoding not only in mpeg, but also in WMV 7/8, H.261 / 3 and other formats without using other libraries?
A: All these encoders actually use the internal mpeg encoder with small variations for each format. It should be borne in mind that the encoder is a large and complex program consisting of many parts, and the only significant difference between the encoders of the formats listed is the entropy coding algorithm and headers. Both can be easily replaced without having to change the rest of the code. That is why there are so many “encoders” in FFmpeg, which are all based on the mpeg main encoder: in fact, the difference between these algorithms is not so significant (they are all MPEG similarities based on 8x8 pixel discrete cosine transform ), so for all of them to be used largely the same code.
This, by the way, explains the absence of WMV9 encoder in FFmpeg - this algorithm is too different from previous versions to be easily implemented on the basis of what is.

Q: Can ffvp8 also decode VP4, 5, 6 and 7?
A: Maybe, but only VP4, 5 and 6, since no one has yet subjected VP7 to reverse engineering. But, most likely, VP7 support will appear in the near future, given the discovery of VP8, since I have a suspicion that VP7 and VP8 mostly coincide.

Q: Where can I get the latest SVN assemblies Media Player Classic HomeCinema and FFDshow tryouts to see for myself the new decoder for Windows?
About: xhmikosr.1f0.de

If you have any questions to the author notes, I am ready to translate them and publish it in his blog.

Source: https://habr.com/ru/post/100076/

All Articles

Introducing the fastest VP8 decoder in the world: ffvp8

Ffvp8 development

Appendix: bare digits

Translator's notes

More articles: