Fast image compression using JPEG for CUDA

Summary: A fast FVJPEG encoder has been created for JPEG image compression on NVIDIA video cards. Significant acceleration is obtained by parallelizing the algorithm, its implementation and optimization using the CUDA technology. In terms of compression speed, the FVJPEG encoder exceeds all currently existing software and hardware solutions for image compression using the Baseline JPEG algorithm.

When comparing lossy image compression algorithms, the degree of compression and the quality of the resulting image are almost always discussed, but the compression time is for some reason considered a secondary indicator. Apparently, for most applications, this approach is valid, but there are situations where compression time can be very important. For example, when compressing large arrays of images or when working with equipment capable of generating huge amounts of data that require real-time compression. This is the situation when compressing a series of images from high-speed video cameras. The data stream from a typical fast camera can reach 625 MB per second (resolution 1280 x 1024, 8 bits, 500 frames per second) and higher. There are high-speed video cameras that write data online using a PCI-Express 2.0 x8 frame grab into the computer’s RAM at a rate of 2.4 GB per second. To work with such streams, the requirement of parallelizing the data processing algorithm does not even need to be discussed - this must be by definition. Therefore, to select a fast compression algorithm, the following criteria were formulated:

the possibility of parallelization of the algorithm for both encoding and decoding
the ability to compress images 10-20 times with losses at acceptable quality
the computational complexity of the algorithm should be as small as possible
dividing the task into as many subtasks as possible
minimum fast memory size requirements for a single data stream

The JPEG algorithm fully complies with these requirements. It is possible that there are other lossy compression algorithms that satisfy these conditions, but here we consider only the option with JPEG.

To begin with, we will study benchmarks of the speed (performance) of image compression with the losses according to the Baseline JPEG algorithm, provided that the image is already loaded into the main memory and all that needs to be done is compression. We will not follow the widespread method when comparing the obtained solution with obviously weak competitors, therefore, as rivals, we consider the fastest commercial solutions for multi-core CPUs today:
')
JPEG encoder from PICTools Photo by Accusoft Pegasus (compression with a capacity of 150-250 MB per second, 8 bits, compression ratio ~ 50%)

JPEG encoder from IPP-7.0 from Intel : uic_transcoder_con.exe, version 7.0 build 205.85, [7.7.1058.205], name ippjy8-7.0.dll +, date Nov 27 2011, 64-bit, there is no official compression rate data, test results are given below)

Norpix JPEG encoder for high-speed video cameras (compression with a capacity of 200-250 MB per second, 8 bits, the compression ratio is unknown)

Unfortunately, there is no way to verify this data, except for the IPP-7 encoder, which itself reports the encoding time, so we just believe what is written. For a more detailed analysis, there is also no data on the performance of each stage of the processing algorithm, so it is necessary to limit the comparison to only general performance indicators with the same compression parameters. The Kakadu JPEG codec could not be found, since now this company only releases the JPEG2000 codec, and the compression speed of libjpeg_simd-6b and libjpeg-turbo was very low.

Since we are talking about the implementation of the Baseline JPEG standard with fixed quantization tables and Huffman, the quality of the compressed image and the compression ratio are uniquely determined by the compression parameters, so there is no need for PSNR measurements and visual quality assessment. However, PSNR measurements and visual quality assessment were performed for all tests.

The results from the best commercial solutions on multi-core CPUs are impressive, so it’s very interesting to try to beat them. To do this, we consider the option of compressing images on a video card using the Baseline JPEG algorithm using NVIDIA CUDA technology. Since the task is to obtain maximum performance, then the "iron" should be appropriate. NVIDIA GeForce GTX 580 is fine.

In this area of research, many projects and scientific works have been discovered that state that the JPEG algorithm cannot be parallelized completely because of the entropy coding stage. Therefore, hybrid solutions were created abroad, when the discrete cosine transform was performed on a video card, and the rest of the calculations were done on a CPU. Thus, the implementation of only one stage accelerated, obviously suitable for the GPU. As a result, they achieved a slight increase in the coding speed, but the results could not compete in terms of compression speed with the best multi-threaded solutions on the CPU. Unfortunately, we failed to find information about the productive JPEG encoders based on NVIDIA or ATI video cards.

Consider a problem in which the original uncompressed image data is located in the computer's RAM and needs to be compressed into JPEG. The implementation of the Baseline JPEG algorithm on a video card includes the following stages:

Downloading data from RAM to the video card (Host-to-Device transfer)
RGB-> YCbCr conversion (not needed for 8-bit images)
Splitting an image into 8x8 blocks
Offset (subtraction 128) for each pixel
Discrete cosine transform (DCT) for each block
Quantization (Quantization) for each block
Reordering (Zig-zag) for each block
Delta Coding (DPCM) for DC from each unit
Series Encoding (RLE) for AC from each unit
Huffman coding for AC from each block
Setting restart markers RSTn for groups of blocks
Generating the output file: pasting data from compressed blocks, adding a JFIF header
Uploading JPEG files from a video card to RAM (Device-to-Host transfer)

This whole scheme has been implemented on CUDA. One of the main ideas of the standard JPEG algorithm is to split an image into 8x8 blocks with subsequent independent data processing using discrete cosine transform, which is essentially a parallel algorithm circuit. The parallelization of the discrete cosine transform was already known and the performance results obtained were much better for the GPU than for the CPU. The delta coding stage (DPCM) of the DC coefficients can also be parallelized. Series coding (RLE) and Huffman coding are performed independently for the data of each 8x8 block, so they will also be paralleled. Finally, you need to write the compressed data from each block into the output file, generate a JPEG image from it and send the file to the computer’s RAM. So it can be argued that almost the entire JPEG encoding scheme successfully parallelizes and this algorithm can be fully implemented on the GPU.

The question remains about the implementation of decoding. For this purpose, the JPEG standard initially provided for markers that make it possible to perform decoding from an arbitrary location of the compressed image, and not only sequentially. Unfortunately, in most implementations on the CPU, the JPEG compression algorithm does not have these markers, so in such situations the first stage of decoding is performed sequentially and “foreign” images will be decoded relatively slowly (PhotoShop puts these markers when encoding). If, when encoding, markers were set, then before the start of decoding, a quick search for all installed markers is done and after that it becomes possible to perform decoding in a parallel fashion. In the compression scheme described above, the installation of markers is made after a certain number of 8x8 blocks, and as a result, the decoding process is also obtained in parallel. However, decoding issues are beyond the scope of this article.

The following standard conditions were chosen for testing the program:

8-bit test images
compression ratio 50-100%
horizontal and vertical image sizes must be multiples of 8
source file size not more than 64 MB

For testing the following computer configuration was used:

ASUS P6T Deluxe V2 LGA1366, X58, ATX Core i7 920, 2.67 GHz, DDR-III 6 GB.
Graphics cards for computing: GeForce GT 240 (cc = 1.2, 96 cores) or GeForce GTX 580 (cc = 2.0, 512 cores).
Operating system Windows-7, 64-bit, CUDA 4.1, driver 286.19

Test 8-bit images were commonly used (lenna.bmp, boats.bmp), from IPP (uic_test_image.bmp), and cathedral.bmp and big_building.bmp are taken here.

The bottom line of the table for each image indicates the correspondence between the compression ratio and the compression ratio (how many times the file size decreases) when compressed with the quantization tables and Huffman, which are adopted by default in Baseline JPEG.
test images

The total running time of the FVJPEG codec in Windows was measured using the QueryPerformanceCounter () function. The execution time of individual functions on a video card was measured using an NVIDIA profiler. The profiler results are necessary for a detailed analysis of the compression rate of each stage of the algorithm, which is of interest primarily to developers. The number of repetitions (the option of many programs to increase the accuracy of measuring time) is equal to one in all tests, since repeating calculations over the same data has no practical meaning. Thus, the main task was to study the real operating conditions of the functioning of the encoders and the limiting values of the compression performance. For small images, there is a spread of 10-20% in coding rate, so the highest results in a series of tests were taken. The tests involved the FVJPEG encoder on NVIDIA GeForce GT 240 and GeForce GTX 580 video cards, as well as the JPEG encoder from IPP-7.0 on the Core i7 920.

Table 2 shows the measurement results for the JPEG compression rate for the NVIDIA GeForce GT 240 video card in megabytes per second for various 8-bit images depending on the compression ratio:
compression speed for the NVIDIA GeForce GT 240 graphics card

compression speed for the NVIDIA GeForce GT 240 graphics card

Table 3 shows the JPEG compression rate measurement results for the NVIDIA GeForce GTX 580 video card in megabytes per second:
compression speed for the NVIDIA GeForce GTX 580 video card

compression speed for the NVIDIA GeForce GTX 580 video card

It is very important that when measuring the compression rate, the time taken to copy data to the video card and back was taken into account. The fact is that the time of copying data into the memory of a video card usually turns out to be one of the longest operations when implementing a compression algorithm in JPEG on a GPU. If we estimate the coding performance without taking into account the data loading into the video card's memory, i.e. if we want to measure exactly the performance of computing in image compression with losses, then for large enough images and with a compression ratio of 50%, we get a JPEG compression rate of 10 GB per second or more for GeForce GTX 580. This result is an excellent illustration of the highest efficiency of parallel computing on powerful graphics cards.

For comparison with the encoder from Intel IPP-7.0 (update 6), the same set of images and the same compression ratios were used as in the tests for NVIDIA video cards. The command line looked like this: uic_transcoder_con.exe -otest.jpg -ilenna.bmp -t1 -q50 -n8, which means compressing the image lenna.bmp, creating a new image test.jpg, using the Baseline JPEG algorithm (-jb), measuring time program execution with high accuracy (-t1), compression ratio of 50% (-q50) with parallelization on 8 threads (-n8). The operation mode -m for repeating the compression procedure was not used, since there are no practical tasks for recompression of the same image.

Table 4 shows the results for the performance of the JPEG encoder from IPP-7.0 (MB compression rate per second / image compression time in milliseconds):
JPEG encoder compression speed from IPP-7.0 to CPU Core i7 920

JPEG encoder compression speed from IPP-7.0 to CPU Core i7 920

When using a test image from IPP-7.0 and with the same compression parameters (uic_test_image.bmp, 1280 x 960, 8 bits, compression ratio 50%, which corresponds to compression 8.1 times), a performance of 2.25 was obtained on the GeForce GTX 580 video card GB per second, which is noticeably better (faster by 6.7 times) than 332 MB per second on a CPU Core i7 920 when paralleling to 8 streams for a JPEG encoder from IPP-7.0. The resulting performance of the IPP-7.0 encoder is two times lower even in comparison with the rather weak GeForce GT 240 video card, which with the same parameters gives a compression rate of 680 MB per second.

Graph 1 shows the results of the compression rate of the test image cathedral.bmp (2000 x 3008, 8 bits) on GeForce GT 240, GeForce GTX 580 video cards (FVJPEG encoder) and CPU Core i7 920 (IPP-7.0 JPEG encoder) for different values Baseline JPEG compression rates:
8-bit image compression speed of cathedral.bmp to JPEG on GeForce GT 240, GeForce GTX 580 and CPU Core i7 920

8-bit image compression speed of cathedral.bmp to JPEG on GeForce GT 240, GeForce GTX 580 and CPU Core i7 920

Using the NVIDIA profiler, measurements were made of the duration of each stage of the coding algorithm. Table 5 shows benchmarks for Baseline JPEG compression time with default quantization tables and Huffman tables, with different compression ratios, for the NVIDIA GeForce GTX 580 video card. Image cathedral.bmp, resolution 2000 x 3008, 8 bits, in brackets after the degree The compression indicates how many times this image is compressed:
The execution time of the main JPEG compression stages on the GeForce GTX 580 video card

The execution time of the main JPEG compression stages on the GeForce GTX 580 video card

Thus, we can see how quickly each stage of the JPEG compression algorithm can be performed on the video card: loading the source data, discrete cosine transform, serial encoding, Huffman encoding, and unloading the compressed image. When estimating the entropy coding rate, it is necessary to distinguish the contribution from the series coding and Huffman coding. The fact is that at different stages of the JPEG algorithm the size of the data is different, and this is important when calculating performance. For Huffman coding, you can divide the size of the compressed file by the execution time as a lower estimate of the coding rate. To calculate the exact value of the speed of each compression stage, it is necessary to measure not only the time, but also the size of the original data at each stage of the calculations.

Thus, the fundamental possibility of a very fast image compression on a video card using the JPEG algorithm is shown, and for a fairly wide class of NVIDIA graphics cards, including budget and even mobile ones. At the same time, the results obtained significantly exceed those benchmarks in JPEG encoding speed, which we have seen for the fastest solutions on multi-core CPUs.

Analysis of the execution time of each stage of the JPEG algorithm on a video card gives the following picture: one of the slowest operations is downloading data from the computer’s RAM to the video card. For large images, download and download speeds are of the order of 6 GB / s. This limitation is due to the bandwidth of the PCI-Express 2.0 bus, and significant acceleration in principle can only be possible when switching to the next generation of PCI-Express (from Gen2 to Gen3). The image loading time and the execution time of the discrete cosine transform depend only on the image size. While serial encoding and Huffman coding depend on the specific data of the image itself, the degree of compression, and the size of the data at this stage of compression.

An important point to increase performance is the size of the image. The larger the image size, the greater the compression rate on the video card. Apparently, this is due to the fact that the data in the small image is not enough to fully load the video card and some of the computing power is idle. If we use relatively large frames (4 megabytes or more), we get a full load and the compression performance on the video card increases. However, for small images there is an acceleration option: you can load several frames at once and encode them in parallel. Thus, you can download the video card completely and get maximum performance even for small-sized images. Therefore, the compression rate for a series of small frames will be close to the results obtained for large single images. This mode of operation for simultaneous loading and compression of several images has already been tested and will be included in the final version of the FVJPEG codec.

I would also like to note that the results obtained on the GeForce GTX 580 video card for JPEG image compression speeds with losses outperform all known hardware FPGA image compression solutions (FPGA) for the same algorithm. Some of the fastest JPEG image compression systems using the JPEG algorithm are offered by the following companies:

Barco BA116 JPEG Encoder (High speed baseline DCT-based JPEG color encoder) - a hardware JPEG encoding function for FPGA, performance up to 140 MB / s.

Cast inc (JPEG-C Baseline JPEG Compression Codec Core) - hardware compression in jpeg for FPGA, performance up to 275 MB / s for Xilinx Virtex-6, up to 280 MB / s for Altera FPGA Stratix IV.

Visengi (JPEG / MJPEG Hardware Compressor IP Core) is a hardware encoding function in Jpeg for FPGA Virtex-5, performance up to 405 MB / s.

Alma-Tech (SVE-JPEG-E, SpeedView Enabled JPEG Encoder Megafunction) is a Baseline JPEG hardware compression feature for FPGA Altera / Xilinx, performance up to 500 MB / s.

At specialized exhibitions, there were reports from private companies about the creation of JPEG encoders on FPGAs with a capacity of up to 680 MB per second (four separate blocks working in parallel and producing 170 MB per second each), but no details about such solutions could be found.

Comparing the software encoder on the GPU with the hardware JPEG encoders on the FPGA, it is worth noting that, apart from the higher performance of the obtained solution on the GPU, compared to Verilog / VHDL, the C code for CUDA is much more comprehensible and adapted for modification to create new ones on its basis , more complex systems for processing and compressing images. It is clear that the FPGA has its advantages, but we will confine ourselves only to the consideration of the issue of computation speed.

For completeness, one should not forget about the existing shortcomings of parallel computing technologies on the video card. First, calculations on a video card only make sense for parallelizable algorithms, which means that for many sequential algorithms there is simply no such possibility, therefore, to implement them, software for the CPU will definitely be needed. Also, CUDA technology can only be used in NVIDIA video cards, and the latest achievements (Fermi technology) are available only for the latest video card models. In principle, there is the OpenCL standard, which is suitable for video cards of various manufacturers, but it does not provide maximum performance for NVIDIA video cards and in general, the universality of this type of solution for all existing video card architectures seems controversial. It is important that streaming multiprocessors on video cards have a very small size of fast shared memory, which introduces additional restrictions on the algorithms used and their efficiency, and the available (relatively slow) GDDR5 memory currently does not exceed 6 GB per video card, while for CPUs the size of the RAM may exceed one hundred gigabytes. In order to get high performance, it is necessary to achieve maximum loading of the video card, and this is possible only for parallel algorithms and for large amounts of data. For computing on a GPU, you always need to first copy the data to the video card, which introduces additional delay and increases the computation time. Therefore, when designing a high-performance solution on a video card, it is necessary to take into account the various features of the algorithm and the possibilities of its implementation in a specific hardware.

This article presents a solution for fast compression of 8-bit images using the standard JPEG algorithm on a single video card. Various methods of scaling performance for such a solution remained in reserve, since now NVIDIA has the technology of parallelizing such tasks on several video cards, which in principle can provide an opportunity for a further multiple increase in compression speed. Using more powerful graphics cards like the GeForce GTX 590 should also lead to better results. There is also a potential opportunity to speed up the algorithm when parallelizing the copy and computation stages. The next, more powerful generation of NVIDIA graphics cards can also be viewed in the context of increased compression performance. Thus, parallel computing technologies on video cards provide ample opportunities for creating fast data processing algorithms and there is still something to improve and where to strive for. Also of great interest for further research is the optimization of the solution obtained, other algorithms for fast image and video compression (MJPEG), including lossless ones.

The software for compressing images on a video card, described in the article, is the result of preliminary research and does not yet exist as a complete commercial product. The release of the FVJPEG codec and the corresponding SDK for fast image compression on NVIDIA GPUs using the Baseline JPEG algorithm for Windows / Linux is expected soon.

Source: https://habr.com/ru/post/139970/

All Articles

Fast image compression using JPEG for CUDA

More articles: