Combining video clips from multiple cameras and synchronizing them over time

In the remote surveillance system (VOS), which was reviewed in the previous article , Kurento media server is used to manage media streams, which allows recording streams, where each stream is a separate file. The problem is that when viewing the exam protocol, you need to play three streams simultaneously with time synchronization (test webcam with sound, proctor webcam with sound, and test desktop), and throughout the exam each stream can be broken. into several fragments. This article is about how to solve this problem, as well as organize the preservation of video on WebDAV server just one bash-script.

Reproduction of the VOS video archive

The Kurento media server saves the media streams in their original form, as they are transmitted from the client, in fact, the stream is dumped to a webm file, the vp8 and vorbis codecs are used (there is also support for mp4). This leads to the fact that the saved files have a variable video resolution and a variable bit rate, since WebRTC dynamically changes the encoding parameters of video and audio streams depending on the quality of communication channels. During each proctoring session, clients can establish a connection several times and interrupt the connection, which leads to the appearance of multiple files for each camera and screen, and also out-of-sync appears in time, if then all these fragments are glued together.

To correctly play such videos, you must perform the following steps:
')

recode all video streams, indicating the static resolution for each camera (each camera has its own resolution, all fragments of one camera have one resolution);
add missing video fragments to compensate for out of sync with the subsequent merging of fragments;
glue together all the fragments of each camera to make three video files;
merge three video files into one integrated screen.

As a result, the recording can be played only after transcoding, but in this task it is an acceptable option, since view the record in the same second after the recording, no one will. In addition, delayed transcoding reduces server load during proctoring sessions, since The recoding process can be scheduled for the night when the load is minimal.

Each proctoring session in VOS has its own unique identifier, which is transmitted by Kurento when a connection is established between the subject and the proctor. Within this session, three streams are created that can be interrupted and resumed for technical reasons or at the initiative of the proctor. For naming video files that are saved by Kurento, the format was “timestamp_camera-session.webm” (mask in the form of a regular expression ^ [0-9] + _ [a-z0-9] + - [0-9a-f] {24 } .webm $), where timestamp is the timestamp for creating the file in milliseconds; camera - camera identifier to distinguish the streams from the test camera's webcam (camera1), the proctor's webcam (camera2) and the stream with the desktop image (screen); session - proctoring session identifier. After each proctoring session, a lot of video fragments are saved; possible variants of video fragmentation are shown in the figure below.

Possible options for video fragmentation

Possible options for video fragmentation

The numbers 1-12 are some timestamps; the bold line is video clips of various lengths; dashed line - missing parts to add; empty gaps - time intervals in which there are no video clips should be excluded from the final video.

The output video file is a block of three parts, two cameras with a resolution of 320x240 (4: 3) and one screen with a resolution of 768x480 (16:10). The original image should be scaled to the specified size. If the aspect ratio does not match this format, then fit the entire image in the center of the specified rectangle, paint the empty areas in black. As a result, the location of the cameras should look like in the picture below (blue and green - webcams, red - desktop).

The location of the cameras on the complex screen

The location of the cameras on the complex screen

As a result, each proctoring session, instead of multiple passages, has only one video file with a recording of the entire session. In addition, the output file takes up less space, because video frame rate is reduced to the minimum acceptable number of 1-5 frames / s. The resulting file is uploaded to the WebDAV server, where VOS applies for this file through the appropriate interface, taking into account the necessary access rights. The WebDAV protocol is quite common, because the repository can be anything, you can even use Yandex.Disk for this purpose.

The implementation of all these functions was able to fit into a small bash-script, for which additional tools ffmpeg and curl will be needed. First you need to recode video files with dynamic resolution and bit rate, setting the necessary parameters for each camera. The transcoding function of the original video file with the specified resolution and the number of frames per second looks like this:

scale_video_file() { local in_file="$1" local out_file="$2" local width="$3" local height="$4" ffmpeg -i "$in_file" -c:v vp8 -r:v ${FRAME_RATE} -filter:v scale="'if(gte(a,4/3),${width},-1)':'if(gt(a,4/3),-1,${height})'",pad="${width}:${height}:(${width}-iw)/2:(${height}-ih)/2" -c:a libvorbis -q:a 0 "${out_file}" }

Particular attention should be paid to the ffmpeg scale filter, it allows you to adjust the image to a given resolution, even if the aspect ratio is different by filling the resulting empty space with black color. FRAME_RATE is a global variable in which the frame rate is set.

Next, you need a function that will create a stub file to fill in the gaps between the video files:

 write_blank_file() { local out_file="$1" [ -e "${out_file}" ] && return; local duration=$(echo $2 | LC_NUMERIC="C" awk '{printf("%.3f", $1 / 1000)}') local width="$3" local height="$4" ffmpeg -f lavfi -i "color=c=black:s=${width}x${height}:d=${duration}" -c:v vp8 -r:v ${FRAME_RATE} -f lavfi -i "aevalsrc=0|0:d=${duration}:s=48k" -c:a libvorbis -q:a 0 "${out_file}" }

Here you can create a video track of a given resolution, duration (in milliseconds) and frame rate, as well as a sound track with silence. All this is encoded with the same codecs as the main video clips.

The resulting video clips of each camera need to be combined, for this the following function is used (OUTPUT_DIR is a global variable containing the path to the directory with video clips):

 concat_video_group() { local video_group="$1" ffmpeg -f concat -i <(ls "${OUTPUT_DIR}" | grep -oe "^[0-9]\+_${video_group}$" | xargs -I FILE echo "file ${OUTPUT_DIR%/}/FILE") -c copy "${OUTPUT_DIR}/${video_group}" ls "${OUTPUT_DIR}" | grep -oe "^[0-9]\+_${video_group}$" | xargs -I FILE rm "${OUTPUT_DIR%/}/FILE" }

You will also need a function to determine the duration of the video file in milliseconds, here we use the ffprobe utility from the ffmpeg package:

 get_video_duration() { local in_file="$1" ffprobe -v error -show_entries format=duration -of default=noprint_wrappers=1:nokey=1 "${in_file}" | LC_NUMERIC="C" awk '{printf("%.0f", $1 * 1000)}' }

Now, when there is a transcoding function, the function of creating missing fragments of a given length, as well as the function of gluing all these fragments, we need the function of synchronizing video clips from different cameras, which will decide which fragments and how long it is necessary to recreate. The algorithm is as follows:

Retrieve a list of files with video clips, sorted according to their timestamp, which is the first part of the file name.

View the list from top to bottom, simultaneously creating another list of the form “time_label: flag: file_name”. The essence of this list is to mark all the points of beginning and end of each video file (see the picture with the video fragmentation illustration). For our example, this will be the following list:

 1:1:camera1-session.webm 3:-1:camera1-session.webm 7:1:camera1-session.webm 10:-1:camera1-session.webm 2:1:camera2-session.webm 5:-1:camera2-session.webm 8:1:camera2-session.webm 10:-1:camera2-session.webm 3:1:screen-session.webm 6:-1:screen-session.webm 8:1:screen-session.webm 12:-1:screen-session.webm

Add the resulting list with entries with zero duration (identical time stamps) for the first and last file of the original list of video clips. This is needed at the stage of calculating the missing intermediate video clips.
Supplement the list with entries that correspond to the beginning and end of the fragments, when there is no video from any of the cameras. In our example, these will be the entries “6: 1: ...” and “7: -1: ...”.
The resulting list is divided into three parts, we get for each camera its own list. Go through each list and invert it, i.e. instead of a list of existing fragments should get a list of missing fragments.
Convert the resulting list to the format “timestamp: duration: filename” so that you can create the missing video clips based on it.

This algorithm is implemented by the following set of functions:

 #   # input: timestamp:flag:filename # output: timestamp:duration:filename find_spaces() { local state=0 prev=0 sort -n | while read item do arr=(${item//:/ }) timestamp=${arr[0]} flag=${arr[1]} let state=state+flag if [ ${state} -eq 0 ] then let prev=timestamp elif [ ${prev} -gt 0 ] then let duration=timestamp-prev if [ ${duration} -gt 0 ] then echo ${prev}:${duration}:${arr[2]} fi prev=0 fi done } #         zero_marks() { sort -n | sed '1!{$!d}' | while read item do arr=(${item//:/ }) timestamp=${arr[0]} for video_group in ${VIDEO_GROUPS} do echo ${timestamp}:1:${video_group} echo ${timestamp}:-1:${video_group} done done } #  ,         blank_marks() { find_spaces | while read item do arr=(${item//:/ }) first_time=${arr[0]} duration=${arr[1]} let last_time=first_time+duration for video_group in ${VIDEO_GROUPS} do echo ${first_time}:1:${video_group} echo ${last_time}:-1:${video_group} done done } #    : timestamp:duration:filename generate_marks() { ls "${OUTPUT_DIR}" | grep "^[0-9]\+_" | sort -n | while read video_file do filename=${video_file#*_} timestamp=${video_file%%_*} duration=$(get_video_duration "${OUTPUT_DIR%/}/${video_file}") echo ${timestamp}:1:${filename} echo $((timestamp+duration)):-1:${filename} done | tee >(zero_marks) >(blank_marks) } #     ,     fragments_by_groups() { local cmd="tee" for video_group in ${VIDEO_GROUPS} do cmd="${cmd} >(grep :${video_group}$ | find_spaces)" done eval "${cmd} >/dev/null" } #    write_fragments() { while read item do arr=(${item//:/ }) timestamp=${arr[0]} duration=${arr[1]} video_file=${arr[2]} write_blank_file "${OUTPUT_DIR%/}/${timestamp}_${video_file}" "${duration}" $(get_video_resolution "${video_file}") done } #    generate_marks | fragments_by_groups | write_fragments

After the missing video clips are recreated, you can begin to combine them. To do this, we need the following function, which combines all the video files of one group (that is, with one camera ID):

 concat_video_group() { local video_group="$1" ffmpeg -f concat -i <(ls "${OUTPUT_DIR}" | grep -oe "^[0-9]\+_${video_group}$" | sort -n | xargs -I FILE echo "file ${OUTPUT_DIR%/}/FILE") -c copy "${OUTPUT_DIR}/${video_group}" }

Now, when there are all three video files synchronized in time, they need to be combined into one integrated screen, placing these files in the necessary parts of the integrated screen:

 encode_video_complex() { local video_file="$1" local camera1="$2" local camera2="$3" local camera3="$4" ffmpeg \ -i "${OUTPUT_DIR%/}/${camera1}" \ -i "${OUTPUT_DIR%/}/${camera2}" \ -i "${OUTPUT_DIR%/}/${camera3}" \ -threads ${NCPU} -c:v vp8 -r:v ${FRAME_RATE} -c:a libvorbis -q:a 0 \ -filter_complex " pad=1088:480 [base]; [0:v] setpts=PTS-STARTPTS, scale=320:240 [camera1]; [1:v] setpts=PTS-STARTPTS, scale=320:240 [camera2]; [2:v] setpts=PTS-STARTPTS, scale=768:480 [camera3]; [base][camera1] overlay=x=0:y=0 [tmp1]; [tmp1][camera2] overlay=x=0:y=240 [tmp2]; [tmp2][camera3] overlay=x=320:y=0; [0:a][1:a] amix" "${OUTPUT_DIR%/}/${video_file}" }

Here, using the ffmpeg filter, an empty black area (pad) is created, then placed on it in the specified camera order. The sound from the first two cameras is mixed.

After processing the video and receiving the output file, upload it to the server (global variables STORAGE_URL, STORAGE_USER and STORAGE_PASS contain the WebDAV server address, username and password to it, respectively):

 upload() { local video_file="$1" [ -n "${video_file}" ] || return 1 [ -z "${STORAGE_URL}" ] && return 0 local http_code=$(curl -o /dev/null -w "%{http_code}" --digest --user ${STORAGE_USER}:${STORAGE_PASS} -T "${OUTPUT_DIR%/}/${video_file}" "${STORAGE_URL%/}/${video_file}") #   ,    201,   - 204 test "${http_code}" = "201" -o "${http_code}" = "204" }

The full code of the script reviewed is posted on GitHub .
To test the operation of the algorithm, you can use the following generator, which creates video clips from the considered example:

 #!/bin/bash STORAGE_DIR="./storage" write_blank_video() { local width="$1" local height="$2" local color="$3" local duration="$4" local frequency="$5" local out_file="$6-56a8a7e3f9adc29c4dd74295.webm" ffmpeg -y -f lavfi -i "color=c=${color}:s=${width}x${height}:d=${duration}" -f lavfi -i "sine=frequency=${frequency}:duration=${duration}:sample_rate=48000,pan=stereo|c0=c0|c1=c0" -c:a libvorbis -vf "drawtext=fontfile=/usr/share/fonts/truetype/dejavu/DejaVuSans.ttf: timecode='00\:00\:00\:00': r=30: x=10: y=10: fontsize=24: fontcolor=black: box=1: boxcolor=white@0.7" -c:v vp8 -r:v 30 "${STORAGE_DIR%/}/${out_file}" </dev/null >/dev/null } # camera1 write_blank_video 320 200 blue 2 1000 1000_camera1 write_blank_video 320 200 blue 3 1000 7000_camera1 # camera2 write_blank_video 320 240 green 3 2000 2000_camera2 write_blank_video 320 240 green 2 2000 8000_camera2 # screen write_blank_video 800 480 red 3 3000 3000_screen write_blank_video 800 480 red 4 3000 8000_screen

As a result, the problem is solved, the resulting script can be placed on the Kurento server and run it on a schedule. After successfully uploading the created video files to the WebDAV server, you can delete the original files, thus archiving the video for later viewing in a readable form.

Source: https://habr.com/ru/post/277179/

All Articles

Combining video clips from multiple cameras and synchronizing them over time

More articles: