📜 ⬆️ ⬇️

Once again about fast JPEG on CUDA

In 2012, Habré already had my article about fast compression in JPEG on a video card. Since then, quite a lot of time has passed and I would like to tell you in general terms about the results that were obtained on this topic. I hope many will be interested to know what level of performance can be obtained on modern NVIDIA graphics cards when solving practical problems on CUDA.

When discussing the operation time of the compression algorithm, I will give the results for measurements of the operating time on the video card without taking into account data loading and unloading. This approach makes sense when organizing complex computing systems, when all calculations are done on a video card and for most stages of the general processing scheme, the initial and final data are located in the memory of the video card. There are also methods for copying data into a video card and back simultaneously with calculations, so a performance discussion in terms of the running time of the algorithm on a video card is fully justified. All measurements are made using NVIDIA Visual Profiler. The JPEG algorithm is implemented in the baseline version (Baseline JPEG), that is, 8 bits per channel, with standard quantization tables and Huffman, without arithmetic coding and without progressive mode.

The encoding speed in JPEG on CUDA has increased significantly since the last publication and now for 0.78 ms on the NVIDIA GeForce GTX 1080 video card you can compress 10 times (90% quality) a 24-bit 4K (3840 x 2160) picture with 4 digitization: 2: 0, which corresponds to a coding performance of about 30 GB / s. For images of 8K and more, the compression rate with similar parameters can be still almost one and a half times higher.

On the one hand, this is the merit of NVIDIA, which produces ever faster graphics cards. On the other hand, this is our result of parallelization and optimization of the JPEG algorithm on CUDA. The transfer time for an uncompressed 4K image from RAM via PCI-Express x16 (Gen3) to this video card is 2.17 ms. We get an interesting result: JPEG compression can now be done two and a half times faster compared to the speed of copying uncompressed data via PCI-Express x16. Our decoding speed is still far behind the coding rate, and we can decode a similar 4K picture in 2.6 ms. In the near future, we hope to eliminate this gap between the encoder and the decoder.
')
cuda jpeg benchmark

Such high performance is needed in cases where you need to handle large arrays or data streams. This speed is not needed to save single images in a graphics editor to JPEG, although the burst mode is quite popular. Developers of 3D and VR visualization applications are mainly interested in fast encoding and decoding, because it is convenient to store the original data as dzhipegov, copy them to a video card, and already there do a quick decoding and output to a monitor or glasses via OpenGL. Thus, it is possible to achieve a high frame rate at high resolutions, for example, 100-120 fps for 12-megapixel images. An important condition for fast decoding is the presence of a necessary number of restart markers inside jipegos, which make it possible to achieve a high degree of parallelization during decoding. Without these restart markers, the JPEG decoding speed drops by an order of magnitude or even more. Our coder sets these markers by default, but they can also be added offline using tools like jpegtran , after which it will be possible to quickly decode dzhipegov on the video card.

To work with modern video cameras, we implemented a 12-bit JPEG encoder on CUDA, which also has very high performance. It compresses a 12-bit 4K picture with a quality of 90% and a 4: 2: 0 sampling rate on a GeForce GTX 1080 video card in 1.2 ms. For most video cameras (we work mainly with industrial and high-speed cameras) a range of 12 bits covers basic needs, and high compression performance allows you to solve a lot of problems in real time even at very high frame rates.

In addition to the JPEG codec, we also implemented demosaic, resize, noise suppression, etc. algorithms for CUDA. In fact, this is a set of fast parallel algorithms for processing RAW data on a video card. These solutions provide an almost complete cycle of image pre-processing from video cameras and they work on all NVIDIA video cards, including mobile Tegra K1 and Tegra X1. In the near future, based on this functionality, we will release an application on CUDA for real-time processing of a series of DNG-format images from BlackMagic Design cameras.

More detailed benchmarks of JPEG codec, demosaic, resize, noise level and JPEG2000 encoder can be found here .

Source: https://habr.com/ru/post/303306/


All Articles