How we made the converter and player for CinemaDNG to CUDA

On Habré, I already had two articles ( 1 and 2 ), both of them concerned the implementation of fast image compression using the JPEG algorithm on CUDA. Now I would like to talk about another, much more ambitious task - as we did the converter and video player for the DNG series of images on CUDA. At the same time, we got a very high speed of work, because all processing of the initial data in the DNG format is now performed on the NVIDIA video card.

Original DNG image taken from blackmagicdesign.com

Despite the fact that there are already a very large number of RAW converters in the world that work with the DNG format, we decided to make another one, but very fast, which could be used including for rejection and sorting. DNG video players are also there, but they usually work with a reduced resolution, so viewing the material that you just shot in the DNG format at full resolution is a problem. With the help of our converter, we made an attempt to process the images so quickly that we were able to view a series of DNG images in real time and at full resolution. Naturally, in addition to speed, it was necessary to obtain an acceptable quality of processing and noise reduction, and it seems to me that we succeeded.

Just in case, let me remind you that DNG is an open source RAW data format offered by Adobe, taken with a video camera or camera. We will consider the case of the video, although for photos the task is almost the same.
')
The condition of the problem: on a fairly fast SSD, there is a series of images in DNG format (all frames are compressed) with a resolution of up to 4K or 4.6K (for example, like the latest URSA or URSA mini video cameras from BlackMagic Design) and you need to read them in real time, decode, do all the necessary processing and smoothly output the video to a monitor with a given frequency in the range of 24-30 frames per second at full resolution (without using proxies, that is, reduced copies).

We have been developing algorithms and software for image and video processing on CUDA for quite a long time, so we had our own fast SDK at our disposal, in which we implemented all the necessary functionality for working with RAW data from video cameras. As a result, the entire image processing circuit currently looks something like this:

1. Multi-threaded reading of DNG files from SSD
2. Parsing DNG files, getting tiles
3. Multi-threaded decoding of DNG images
4. Sending decoded images to a video card
5. Crop for DNG
6. Linearization of source data, bringing them to 16 bits
7. Applying levels of white and black
8. White balance
9. Exposure Correction
10. Shumodav to the debtor
11. Debyer (demosaic)
12. Noise after debtor
13. Color Transformations, Temperature and Tint
14. Curves and levels in RGB
15. Curves and levels in HSV
16. Crop and resize (for a given zoom)
17. Sharpness
18. Overlay monitor profile and gamma
19. Data conversion to 8 bits per channel
20. Copying data from CUDA memory to OpenGL texture and then output to the monitor after receiving V-Sync
21. Calculation and display of histograms and parades for each frame

As can be seen from the description, the internal format of data presentation is 16 bits per channel. This is not as bad as it might seem at first glance. All parts of the code where accuracy is very important are counted in the float (noise, resize, sharp), and the final result of processing each stage is stored in 16 bits. Many debaters are integer algorithms, so for them 16 bits are just right, but some stages still have to be done in float. Also, sometimes we combine neighboring processing stages, for example, white balance and exposure compensation, which reduces the number of intermediate rounding. We visually compared the results with the 32-bit implementation and found no significant difference. It seems to us that this is caused by a relatively small number of intermediate processing stages in our software.

I would like to say a few words about debtors, i.e., about demosaic algorithms. Very often, converters use a bilinear algorithm or its analogs to reduce computation time, because this is one of the fastest options. Indeed, on the CPU this is true, but if you look at the quality of the restored picture from the point of view of the peak signal-to-noise ratio (PSNR), then for the standard set of Kodak images used for testing the debayers, the bilinear algorithm produces less than 31 dB. The algorithms of the HQLI and DFPD debtors from our program give 36 dB and 39 dB on the same set of frames. Since the performance of a consumer on a GPU is many times higher than on a CPU, there is no need for a bilinear debier, and better algorithms can be used. When we made a JPEG codec on a video card, we measured the PSNR depending on the quality factor for different receivers and got an interesting result: the low PSNR of the final picture is determined by the bilinear debier, and not by the artifacts of the dzhipeg with at least 75% quality. The summary is simple: if quality is required, then it is better not to use the bilinear debier. It is clear that the PSNR metric (like SSIM and others) is fairly arbitrary, but it is an objective criterion that works in most cases, although not always.

We should soon complete the development of a new debtor algorithm for the GPU, which gives PSNR 40.7 dB on the Kodak set. The version on the CPU is ready and the test application with the command line is in the public domain. With it, you can test all of our algorithms of debtors and compare them.

The new algorithm of the customer is MG (multiple gradients), it is made by us, in other RAW converters it is not.

The implementation of such a general processing scheme for 10/12/14-bit raw data in DNG format in real time with a frame rate in the range of 24-30 fps for resolutions from 2K to 4K requires careful optimization of each algorithm. At the same time it is necessary to achieve the highest possible speed of SSD, CPU and GPU. If the performance of one of these three components of iron is insufficient, then real time will not work. For 4K-4.6K resolutions, we achieved faster processing speed than real-time using NVIDIA GeForce GTX 980 and 1080 video cards.

On a good hardware, our DNG player works smoothly, the image can be scaled to full screen. Hardware resizing in OpenGL is disabled, because it defaults to the bilinear resize algorithm, which gives significant artifacts, especially when decreasing. To solve this problem, we always do a resize on CUDA using the Lanczos algorithm and transfer to OpenGL a ready-made image, the dimensions of which coincide with the size of the window. At the same time, we get an additional time delay, but the picture quality improves.

Image in DNG format taken from this site , the operator Joe Brawley, test shots BlackMagic Ursa Mini 4.6K.

To save the results in an arbitrary container, you can use an external FFmpeg, which the user must install independently, and it can be run from our program with the specified command line. Thus, for example, using an external FFmpeg, you can compress the output data into a 10-bit 444 ProRes and save them into the MOV container. Without the help of FFmpeg, the program itself can save the processed frames as a series of 16-bit images in TIFF or 8/12-bit JPEG format, and the color profile is embedded in the header of each frame.

In the same way can be processed and photos. But since there are a lot of different closed formats in the photo, then to complete the work you will need to connect libraw, but for now there are only options with the preliminary conversion of the original data into DNG using Adobe DNG Converter. Our program was not originally focused on the photo, so many important features are missing. In our SDK, on the basis of which the software is made, a part of the necessary functionality has already been implemented, so, perhaps, we will also make a quick photo converter. In this case, batch processing of photos should be very fast, because in the jeep we are compressing on the video card. On a good card, the processing time of one 50-megapixel image is less than the loading time of an image with an SSD. But to work with such large files you need a GPU with a memory of at least 8 GB.

The above image processing circuit is not complete enough, as long as there is no chromatic aberration suppression module, 3D LUT cannot be connected yet, the interframe noise is not ready, the editing codec is still external, and there is still not enough for it. All this will be done soon. However, it is already clearly seen that the implementation of the entire processing scheme for a series of DNG images on CUDA is possible even on a single video card, and in real time and at maximum resolution.

We know about the existence of Adobe Premiere Pro 2015, BlackMagic DaVinci Resolve 12 and many other universal professional solutions from this area, including on video cards. Our task was not to create competitor programs for file management, nonlinear editing, grading and compression using assembly codecs — in these areas, existing solutions do a good job with existing tasks, although there is room for acceleration and improvement. We made a decision for very fast and fairly high-quality data processing on the video card, and users can appreciate what we did.

I would like to note an important point: in our approach, we are not talking about acceleration on the GPU of individual algorithms in image processing, as is often the case in many applications. In our program, all processing of a series of DNG images is performed on a video card , and this is the fundamental difference from all known solutions in this area. From our point of view, this is the optimal approach for increasing the speed of work and an interesting opportunity for improving quality in real-time applications.

So that the user could get information about the execution time of each stage of the general image processing circuit, we made a special module that measures and shows the time on the video card of each algorithm for a given image. If you activate the Benchmarks module, the main data about the used memory, image parameters and a list of involved processing algorithms along with their execution time will be displayed.

To solve the problem of a quick preview of the DNG series, we made a separate mode of operation. This feature was very much requested by those users who are engaged in rejection and sorting of materials in the DNG format. Now you can launch the program from the Explorer through the context menu and view the video from the DNG series of images in this folder in the player.

A promising option for the development of this direction is the creation of plug-ins for Adobe Premiere Pro and After Effects. In this case, it will be possible to combine our speed and quality of processing with the ability to save the results using a variety of assembly codecs without using FFmpeg. Such plug-ins can be used instead of the editing codec to make real-time processed DNG frames of 16 bits per channel and transfer them directly to After Effects in real time, then the decoding stage from ProRes is not required.

We are also working on another interesting aspect of this project - we are doing a fast JPEG2000 codec on a video card in order to use it as a mounting codec. This codec will be able to work with 16-bit data in real time, which will provide higher quality compared to other assembly codecs. Even 10/12-bit data compression can be done with better quality than it is now. Such JPEG2000 codec can work in real time and will give higher quality in comparison with widely used ProRes, DNxHD, DNxHR. Currently, the JPEG2000 codec on the CPU is already in the Adobe Premiere Pro, but it is very slow, so in practical terms there is little benefit from it. The fast JPEG2000 codec will radically change everything and will improve the quality of intermediate materials for subsequent editing.

We are continuing to work on improving the quality of image processing, and in the near future we expect the release of a new debtor MG, which should be no worse than that of Adobe Raw (the ACR debtor from ACR is very good from our point of view). And our debugger DFPD in the current release is clearly better than that of Adobe Premiere Pro 2015. In our free version there is a noise level before and after the user, but this is not present in the free version of DaVinci Resolve, while Premiere Pro does not even have its own noise level in paid version. Each program has its pros and cons, so you can compare a lot and for a long time, but in the end, everyone still finds a tool for himself that meets his own needs and tasks.

The free demo version of the program for Windows-7/8/10 (64 bit) can be downloaded from here , and the test series of CinemaDNG images can be downloaded here . The program can also work with video from Canon 5D Mark III with an alternative Magic Lantern firmware after converting MLV to DNG.

I would appreciate comments on the quality and speed of the presented solution. Once again I want to remind you that the program works only with NVIDIA graphics cards, and to work with 4K images of DNG format, it is desirable to have at least 2 GB of memory on the GPU. Unfortunately, the program will not work on AMD and Intel video cards. And if almost all the memory of the NVIDIA video card is occupied by other applications, alas, too. To achieve high processing speed, you need fast SSD, CPU and GPU.

Source: https://habr.com/ru/post/306566/

All Articles

How we made the converter and player for CinemaDNG to CUDA

More articles: