Effective video encoding in Linux with Nvidia NVENC: part 2, additional

In the first part, I talked about video encoding in Linux using Nvidia NVENC. As mentioned earlier, Nvidia for desktop video cards limits the number of encoding streams to two sessions per system. This part is dedicated to combating this limitation.

Environment

Everything described happens on a GTX 970 machine and an installed FFmpeg, in accordance with the configuration discussed earlier.
')

External manifestations

If you try to start coding more than two video streams in parallel, FFmpeg will generate an error:

 ...
 [nvenc @ 0x3187200] OpenEncodeSessionEx failed: 0xa - invalid license key?
 ...
 Error while opening encoder for output stream # 0: 0 - maybe incorrect parameters such as bit_rate, rate, width or height

In order to constantly and conveniently reproduce this error, I started recoding to ffmpeg twice and paused the process, sending SIGSTOP (Ctrl + Z in the terminal) to it:

 $ / usr / local / bin / ffmpeg -y -i input.mov -vcodec nvenc output1.mp4
 ...
 Stream mapping:
   Stream # 0: 1 -> # 0: 0 (mpeg4 (native) -> h264 (nvenc))
   Stream # 0: 0 -> # 0: 1 (aac (native) -> aac (libfdk_aac))
 Press [q] to stop [?] For help
 frame = 81 fps = 80 q = 0.0 size = 1362kB time = 00: 00: 03.24 bitrate = 3444.9kbits / s
 [1] + Stopped / usr / local / bin / ffmpeg -y -i input.mov -vcodec nvenc out1.mp4
 $ / usr / local / bin / ffmpeg -y -i input.mov -vcodec nvenc output2.mp4
 ...
 Stream mapping:
   Stream # 0: 1 -> # 0: 0 (mpeg4 (native) -> h264 (nvenc))
   Stream # 0: 0 -> # 0: 1 (aac (native) -> aac (libfdk_aac))
 Press [q] to stop [?] For help
 frame = 81 fps = 80 q = 0.0 size = 1362kB time = 00: 00: 03.24 bitrate = 3444.9kbits / s
 [2] + Stopped / usr / local / bin / ffmpeg -y -i input.mov -vcodec nvenc out1.mp4

ltrace

Let's look at this place in more detail:

 $ ltrace / usr / local / bin / ffmpeg -y -i input.mov -vcodec nvenc out3.mp4 2> & 1 |  less
 ...
 dlsym (0x313e360, "cuInit") = 0x7f93974182c0
 dlsym (0x313e360, "cuDeviceGetCount") = 0x7f9397418760
 dlsym (0x313e360, "cuDeviceGet") = 0x7f93974185c0
 dlsym (0x313e360, "cuDeviceGetName") = 0x7f93974188e0
 dlsym (0x313e360, "cuDeviceComputeCapability") = 0x7f9397418f80
 dlsym (0x313e360, "cuCtxCreate_v2") = 0x7f9397419940
 dlsym (0x313e360, "cuCtxPopCurrent_v2") = 0x7f9397419df0
 dlsym (0x313e360, "cuCtxDestroy_v2") = 0x7f9397419af0
 dlopen ("libnvidia-encode.so.1", 1) = 0x3231970
 dlsym (0x3231970, "NvEncodeAPICreateInstance") = 0x7f93970d4370
 posix_memalign (0x7fffb429d490, 32, 640, 0x7fffb429d3f8) = 0
 memset (0x3141420, '\ 0', 640) = 0x3141420
 free (0) = <void>
 pthread_mutex_lock (0x19a90e0, 8, 0xf8f340, 0x7fffb429d3f8) = 0
 __vsnprintf_chk (0x7fffb429c3b4, 1004, 1, -1) = 20
 __vsnprintf_chk (0x7fffb429cbb4, 1004, 1, -1) = 55
 __snprintf_chk (0x7fffb429cfa0, 1024, 1, 1024) = 75
 strcmp ("[nvenc @ 0x3187200] OpenEncodeSe" ..., "\ n") = 81
 __strcpy_chk (0x19a8cc0, 0x7fffb429cfa0,1024, 0) = 0x19a8cc0
 fputs ("[nvenc @ 0x3187200]", 0x7f93a554e1c0 [nvenc @ 0x3187200]) = 1
 fputs ("OpenEncodeSessionEx failed: 0xa" ..., 0x7f93a554e1c0OpenEncodeSessionEx failed: 0xa - invalid license key?
 ) = 1
 pthread_mutex_unlock (0x19a90e0, 0, 0x7fffb429cbb4, -1) = 0
 ...

It can be seen that some error was brought out, but what was caused was not visible due to the dynamic loading of the library and its symbols (functions).

FFmpeg source code

Let's look for this place in the source code of FFmpeg itself in order to take it as a starting point.

 ~ / ffmpeg-2.7.1 $ fgrep -r OpenEncodeSessionEx
 ...
 libavcodec / nvenc.c: 606: nv_status = p_nvenc-> nvEncOpenEncodeSessionEx (& encode_session_params, & ctx-> nvencoder);
 libavcodec / nvenc.c: 609: av_log (avctx, AV_LOG_FATAL, "OpenEncodeSessionEx failed: 0x% x - invalid license key? \ n", (int) nv_status);
 ...

Everything is clear, here and put a breakpoint.

Light GDB

The GNU Debugger is a basic unix debugger whose purpose is to debug programs so that they do not generate errors.

For orientation in the machine code of the compiled application and its relation to the source code, it is desirable that the application be compiled with debugging symbols. They, first of all, contain information on the compliance of the machine and source codes.

In most distributions, the packages contain binary files with truncated debugging information and for some packages the debugging information is supplied as a separate package. In ubuntu, these are usually packages with the suffix -dbg. In centos, you need to connect a repository with debugging symbols and use the debuginfo-install utility from yum-utils, which will install debugging symbols for the package and its dependencies.

In our case with the self-compiled FFmpeg, the uncircumcised binary is available in its build directory under the name ffmpeg_g. We can run it under the debugger and immediately put a breakpoint on the desired line in the source code.

 # gdb ffmpeg-2.7.1 / ffmpeg_g
 GNU gdb (Ubuntu 7.7.1-0ubuntu5 ~ 14.04.2) 7.7.1
 Copyright (C) 2014 Free Software Foundation, Inc.
 ...
 Reading symbols from ffmpeg-2.7.1 / ffmpeg_g ... done.
 (gdb)

Set breakpoint to the place of interest to us:

 (gdb) break nvenc.c: 606
 Breakpoint 1 at 0x44a890: file libavcodec / nvenc.c, line 606.

Run the program, specifying the launch arguments through the arguments to the run command:

 (gdb) run -i in.mov -vcodec nvenc out3.mp4
 ...
 Breakpoint 1, nvenc_encode_init (avctx = 0x1b806e0) at libavcodec / nvenc.c: 606
 606 nv_status = p_nvenc-> nvEncOpenEncodeSessionEx (& encode_session_params, & ctx-> nvencoder);
 (gdb) list
 601}
 602
 603 encode_session_params.device = ctx-> cu_context;
 604 encode_session_params.deviceType = NV_ENC_DEVICE_TYPE_CUDA;
 605
 606 nv_status = p_nvenc-> nvEncOpenEncodeSessionEx (& encode_session_params, & ctx-> nvencoder);
 607 if (nv_status! = NV_ENC_SUCCESS) {
 608 ctx-> nvencoder = NULL;
 609 av_log (avctx, AV_LOG_FATAL, "OpenEncodeSessionEx failed: 0x% x invalid license key? \ N", (int) nv_status);
 610 res = AVERROR_EXTERNAL;
 (gdb)

Breakpoint worked and we actually reached the right place in the code. For convenience, by pressing Ctrl + X, Ctrl + A, you can switch GDB to the split screen command screen with the source screen.

Go through the code step by step before returning this function.

 606 nv_status = p_nvenc-> nvEncOpenEncodeSessionEx (& encode_session_params, & ctx-> nvencoder);
 (gdb) step
 603 encode_session_params.device = ctx-> cu_context;
 (gdb) step
 606 nv_status = p_nvenc-> nvEncOpenEncodeSessionEx (& encode_session_params, & ctx-> nvencoder);
 (gdb) step
 607 if (nv_status! = NV_ENC_SUCCESS) {

The last command entered, by the way, can be repeated simply by pressing Enter. The return function is saved to the local variable nv_status. Let's see what's in it:

 (gdb) info locals
 ...
 nv_status = NV_ENC_ERR_OUT_OF_MEMORY
 ...

We kill the hanging ffmpeg-and, in the debugger, restart the program with the command run. This will launch it with the same arguments. Having reached the same place, we will see:

 (gdb) info locals
 ...
 nv_status = NV_ENC_SUCCESS
 ...

Thus, the function for creating a coding session returns NV_ENC_SUCCESS (0) if successful, or NV_ENC_ERR_OUT_OF_MEMORY (10) if the user has already opened 2 coding sessions. Go down into this function.

Dark GDB

Let's get to the call of this function and go down into it.

 Breakpoint 1, nvenc_encode_init (avctx = 0x1b806e0) at libavcodec / nvenc.c: 606
 606 nv_status = p_nvenc-> nvEncOpenEncodeSessionEx (& encode_session_params, & ctx-> nvencoder);
 (gdb) layout asm

The GDB interface will look like this:

Forcibly redraw the screen, if it was littered, you can press Ctrl + L.

The execution pointer is loaded on the function parameters before calling it. Going deep into:

 (gdb) set step-mode on
 (gdb) step
 ...

We appear inside the function from /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.1:

   > | 0x7fffe289b010 mov% rbp, -0x20 (% rsp) |
    | 0x7fffe289b015 mov% r12, -0x18 (% rsp) |
    | 0x7fffe289b01a mov $ 0x6,% ebp |
    | 0x7fffe289b01f mov% rbx, -0x28 (% rsp) |
    | 0x7fffe289b024 mov% r13, -0x10 (% rsp) |
    | 0x7fffe289b029 mov% rsi,% r12 |
    | 0x7fffe289b02c mov% r14, -0x8 (% rsp) |
    | 0x7fffe289b031 sub $ 0xa8,% rsp |
    | 0x7fffe289b038 test% rdi,% rdi |
    | 0x7fffe289b03b sete% dl |
    | 0x7fffe289b03e test% rsi,% rsi |
    | 0x7fffe289b041 sete% al |
    | 0x7fffe289b044 or% al,% dl |
    | 0x7fffe289b046 jne 0x7fffe289b060 |
    | 0x7fffe289b048 mov (% rdi),% eax

So we go through the entire function step by step, writing down directions of conditional transitions on a piece of paper. Going into the calls of other functions, in order not to go deep, we immediately exit them with the command finish. We do this 2 times, when all coding sessions are busy and when there are free ones.

Following this methodology, we come to the conclusion that branching begins from the spot:

    | 0x7fffe289b319 callq 0x7fffe288d510 |
    | 0x7fffe289b31e test% eax,% eax |
    | 0x7fffe289b320 mov% eax,% ebp |
    | 0x7fffe289b322 jne 0x7fffe289b332 |

Decision function:

   > | 0x7fffe288d510 mov% rbx, -0x20 (% rsp) |  | 0x7fffe288d515 mov% rbp, -0x18 (% rsp) |  | 0x7fffe288d51a mov% rdi,% rbx |  | 0x7fffe288d51d mov% r12, -0x10 (% rsp) |  | 0x7fffe288d522 mov% r13, -0x8 (% rsp) |  | 0x7fffe288d527 sub $ 0x28,% rsp |  | 0x7fffe288d52b test% rsi,% rsi |  | 0x7fffe288d52e mov% rsi,% r12 |  | 0x7fffe288d531 mov% rcx,% r13 |  | 0x7fffe288d534 mov $ 0x6,% ebp |  | 0x7fffe288d539 je 0x7fffe288d54d |  | 0x7fffe288d53b dec% edx |  | 0x7fffe288d53d mov $ 0xa,% bpl |  | 0x7fffe288d540 je 0x7fffe288d568 |  | 0x7fffe288d542 cmpb $ 0x1,0x10 (% rbx) |  | 0x7fffe288d546 je 0x7fffe288d5a3 |  | 0x7fffe288d548 mov $ 0x2,% ebp |  | 0x7fffe288d54d mov% ebp,% eax |  | 0x7fffe288d54f mov 0x8 (% rsp),% rbx |  | 0x7fffe288d554 mov 0x10 (% rsp),% rbp |  | 0x7fffe288d559 mov 0x18 (% rsp),% r12 |  | 0x7fffe288d55e mov 0x20 (% rsp),% r13 |  | 0x7fffe288d563 add $ 0x28,% rsp |  | 0x7fffe288d567 retq |

This function checks something in memory, if it succeeds, it does something and returns 0, if not, then 10. That is exactly the error code from the result of this function that nvEncOpenEncodeSessionEx () returns in case of failure. Let's try to ignore the return of this function, as if it returned 0.

We stop after callq 0x7fffe288d510 and before test% eax,% eax. We reset the register with the result of the function and continue the free execution of the program:

 (gdb) set $ eax = 0
 (gdb) continue

Recoding began! And even produces the right results. So it is necessary that the code always be like this. Fix this in libnvidia-encode.so.1 itself.

You need to understand where this place is in the physical file on the disk. Find out what offset in the file corresponds to the virtual address in the library code loaded into memory.

 (gdb) info proc mappings
 process 1692
 Mapped address spaces:

           Start Addr End Addr Size Offset objfile
 ...
       0x7fffe2887000 0x7fffe28a8000 0x21000 0x0 /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.346.46
 ...

We are interested in the neighborhood address 0x7fffe289b31e, it falls into this region. Then the offset in the file is: address - starting address + segment offset.

 (gdb) print / x 0x7fffe289b31e - 0x7fffe2887000 + 0x0
 $ 7 = 0x1431e

Biew

It remains to patch the file itself. I haven’t found anything better than biew (was renamed beye). Before making a backup, fix the file:

 biew /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.346.46

In it: F2 -> Disassembler, F5 -> 1431e
We get this picture:

It looks exactly like the required code, so we hit right. We need the eax register to be 0, and the conditional transition never happens.

Pressing F4 turns on editing mode. In biew there is no such convenient mode as in hiew, in which you can directly enter the instruction, and the editor assembles it. Therefore it is necessary to manipulate opcodes numerically. Write, for example, like this:

Byte at offset 0x1431e from 0x85 value is changed to 0x29. The instruction "test eax, eax" turns into "sub eax, eax". Two bytes at offsets 0x14322 and 0x14323 are replaced by 0x90 - this is a widely known opcode nop.

Total

The resulting solution works quite well. By applying standard tools, you can achieve a lot and expand the boundaries of the possible.

Source: https://habr.com/ru/post/262563/

All Articles