Oh, I have a delay

People often come to us with such a problem, but we must immediately clarify: these are usually men, and we are engaged in the delivery of video.

What is it about? It's about reducing the delay between when something happens in front of the camera and when it comes to the viewer. It is clear that the broadcast lectures on quantum physics will go longer than the comedy club, but we are still engaged in technical details.

Before turning to the discussion of delays (it’s latency, delay), you need to answer a very important question: why reduce them at all. Almost always want to reduce the delay, but not always required .
')
For example, a live broadcast with an acutely political talk show is, in principle, worth a minute from 3 to keep off the live broadcast so that you can quickly respond to the dramatic development of the discussion, but the webinar or remote control of the drone requires a minimum delay so that people can calmly interrupt each other , and the load fell exactly on target.

Before moving on, let's fix one important fact: reducing delays in broadcasting video and audio is expensive, and nonlinearly expensive, so at some point you need to stop at the required delay, and for this you need to go back and understand: why do we need it to cut.

This entry is not just. We conducted a survey among our customers, so it turned out that there was a contradictory desire: those who broadcast TV channels want streaming with a low delay. Why - no one explained, just want.

Delay formation

Let's see how the delay is formed during video transmission. An exemplary video signal delivery scheme (video path scheme) is as follows:

image from the camera sensor is captured in video memory
video encoder puts the raw image in the coding buffer
video compression algorithm finds the best way to compress several video frames in the buffer
compressed video frame is sent to the server video delivery buffer (if there is one)
the video frame is transmitted via UDP or copied to the TCP TCP buffer for sending
the bytes reach the client and are added to the nuclear buffer for accepting network data
reach the customer
may be added to the frame sorting buffer
from there, if necessary, are added to the buffer for compensating fluctuations of the network speed
from it are sent to the decoder, which accumulates its buffer for b-frames
the decoder decodes the frame and sends it to the drawing

This is an approximate scheme, in some cases, some parts are thrown away, some buffers are added somewhere. But in general, we see: buffers, buffers, once again buffers, again buffers.

Why? Because buffering is the usual way to reduce costs, increase overall throughput. There is one more thing: the buffer helps smooth out vibrations. The speed of transmission over the network fluctuates - it does not matter, we download more bytes / frames at the start and while the Internet is recovering, we will play what lies in the buffer.

Those. Let us note a rather simplified thesis: buffers are needed for optimization due to batch processing of data and for compensating for variations in video path characteristics .

To reduce the delay between what is happening and what the viewer sees, it is necessary to work systematically at every stage.

Details

Removal from the sensor

It seems to be a trifle, but in the good old analog television, you can start playing a line on TV before you finish shooting it (by the way, here I am exaggerating, but it will be interesting to know how it really is).

But if you look at it, you can understand that the sensor today is 2 megapixels minimum, or even more. The Intel Xeon is not attached to it at all, but a minimally coping piece of hardware that just spends time copying data.

As far as I know, today there are no widely used video transmission technologies that allow working with raw video in pixel streaming mode. Those. until the entire frame is taken from the sensor, nothing can be done with it.

I am not ready to give an exact assessment of the delay here.

Coding buffer

The encoder is engaged in a very resource-intensive task, as well as terribly loading the data transfer bus between the memory and the processor. He needs to go through different combinations of video compression options, find the difference between adjacent frames and make a bunch of complex mathematical calculations. Considering that FullHD video at 25 frames per second is in the order of gigabit per second (100 megabytes), the load is huge. But please do not make a classic mistake and do not confuse the processor load with a delay. The time it takes to compress the frame is still less than 1 / fps (otherwise, you can no longer twitch, it still does not work), and the delay encoder creates much more.

The fact is that the encoder accumulates in the buffer several consecutive raw frames in order to produce the lowest possible bitrate with the highest possible quality. Tasks for which a buffer is created here are as follows:

maintaining the average bit rate of the stream. If in one frame you really want to make the quality better, then on the remaining frames you have to try to tighten
selection of optimal frames that can be referenced. Sometimes it happens that it is worth referring not to the previous frame, but to the next one. Thus, there are permutations of personnel and traffic savings of up to 15-20%

You can play with this delay, but first of all it will lead to an increase in the bitrate. There is a good post on the site left by the author libx264 about low latency coding . Here it is.

Total, here you can cope for 1-2 frames (40 ms each), and you can spend up to 3-5 seconds, but save the bitrate.

Remember, I first said that you would have to pay for low latency? Now you can start paying bitrate.

Server buffer

Perhaps the most frequent question for us is about the delay: “I have a very long delay when broadcasting via HLS, where you have to remove the buffer on the server”.

In fact, server buffering is quite possible, for example, when packing mpegts, I really want to wait with sending audio frames to put several frames in one PES packet. Or when packing protocols such as HLS or DASH, you generally have to wait a few seconds.

The important point here: for example, in mpegts they like to pack several audio frames into one PES frame. Theoretically, you can open a PES packet, start writing something into it and send it to the network, then send a video frame, and then continue with another video frame. But there is a common problem: in the PES audio frame, its length goes, which means you need to accumulate audio. Accumulate means buffer, means increased delay.

Some servers buffer frames even when using frame-by-frame protocols like RTMP in order to reduce CPU utilization, because sending one kilo is 100 kilobytes cheaper than 2 times 50 times.

Those. here everything depends on the protocol: if we have an HLS or DASH server, then buffering at least a segment (1-10 seconds) is inevitable. If the frame protocol is not necessary, you can safely send frames to all clients one by one, but they rarely do so anyway.

If we receive from somewhere, for example, RTP (from RTSP / RTP cameras), then theoretically we can distribute RTP packets to customers immediately after receiving them. This will give a disruptive delay reduction of less than one frame. In practice, this approach is rarely implemented, because it creates a huge complexity of programming and dramatically reduces the variability of the use of software. Most often video streaming servers work with frames cleared from containers and protocols.

There is a small detail: there is the initiative CMAF low latency . The idea is that when the reference frame arrives (aka keyframe), the server announces a new segment to all clients. All clients immediately begin to download it, and here they get a frame-by-frame via http progressive download.

Thus, the transfer of files with their caching on intermediate CDNs is obtained, and the ability to receive frames without delay when connecting to a server that can distribute this without buffering.

This is still an initiative in development, but it can be interesting.

Total: from the frame buffer on the server, in principle, you can refuse if you are not using HLS, but even if you are using HLS, then under special conditions you can think of something.

Network buffer to send

We have come to the very pulp, the stumbling block and endless rushing of video delivery: UDP or TCP? Losses or unpredictable delays? Or can combine?

In theory, in an ideal world where there are no unsuccessful routers , UDP passes at the speed of ping or is lost, and TCP can slow down sending.

As soon as we start sending video via TCP, the question arises not only about the choice of the protocol, which makes it possible to cut the stream into frames, but also the size of the output buffers in the kernel. The larger the nuclear buffer, the easier it is to send the software and the less context switches can be made. Again due to the growth of the delay.

We increase the nuclear buffers and quickly lose control over the download speed - it becomes hard to control the sending of frames and it becomes unclear on the server: the client is downloading the video or not.

If the helmet is over UDP, then you need to decide what to do with packet loss. There is an option to resend UDP packets (some kind of under-TCP), but it requires buffering on the client (see below). There is a variant with the organization of something like RAID-5 on top of the network: redundancy is put into each udp package, which allows recovering one package from, say, 5 (see FEC, Fountain Codes, etc.). This may require a growth delay on the server to calculate such redundancy, as well as raises the bitrate by 10-30%. It is considered that redundancy does not require extra buffer on the client, or at least it will be 1-2 frames, but not 5 seconds (125 frames)

There is a more sophisticated version: encode video to H264 SVC, i.e. put the data into one packet to restore the worst quality of the frame, the next one to improve the quality and so on. Then these packages are labeled with different levels of value, and a smart, good, kind router along the way will certainly guess and start throwing out the most unnecessary frames, smoothly reducing the quality.

Let's go back to the real world.

With FEC, there are both good promises and realities from Google: "XOR FEC does not work . " It is not yet clear for a very long time. On the other hand, in the satellite delivery FEC has long been used, but there is no other control over the errors.

With SVC, everything is fine, except that it does not take off. Resembles JPEG2000 or wavelets: everyone is good, but something is not enough to conquer the world. In fact, it is used in closed video conferencing implementations, where the server and the client are under control, but this mechanism cannot be used right away.

R-UDP is in fact complex, replaces TCP with itself, is rarely used and is well applicable where HLS with its 30 seconds delay is also suitable. There is a danger of getting involved in TCP re-implementation, which can be considered an almost impossible task.

It is believed that a similar approach with UDP is well suited to sending through channels with a giant RTT and losses, because it does not slow down the transfer to confirm it. The important point is that in the case of video streaming, there is no need to slow down the sender: traffic flows exactly at the speed with which it is needed. If you start to slow down, you can not transmit at all, but choose a lower quality. In turn, TCP is a very general delivery protocol and it has assumptions that are incorrect for live broadcast:

the data must either transfer everything or break the connection. For a live broadcast, this is not the case, you can safely throw away something that could not be sent, let the video crumble into squares rather than start to stick.
data transfer can be slowed down, which would then be accelerated to transfer. And this is also not relevant for live broadcast: either the total thickness of the channel is enough for transmission, or not. Faster or slower video stream will not flow (without reconfiguration of the transcoder)

The consequence of this is the fact that a large ping at a long distance can begin to slow down TCP, although packets go fast. UDP will forward packets at a real-time rate: no faster, but no slower and no delivery confirmation is required.

Delivery to customer

The increase in the delay in delivery from the server to the client consists of the delay of packet transmission and the percentage of losses. High loss will result in slow delivery due to data re-sending in the case of TCP. In the case of UDP, the recovery mechanisms will be activated more often, or the video will more often crumble.

In any case, compulsory election of the route type here helps a little: do not send video directly from Moscow to Thailand, but do it through the Amazon cloud in Singapore (personal experience), but there are no miracles, we rested at the speed of light for a long time, so there are ways other than physical movement closer and not prompt.

This part can be as good as 10 ms, or stretch over 300 ms (on such an RTT it is generally difficult to achieve a decent speed).

In rare cases, such questions are solved by the CDN, but in practice it’s not worth hoping for it, and certainly you shouldn’t trust marketers who are ready to promise anything.

The funny thing is that the main problem may arise on the last meter from the Wi-Fi router to the laptop. Sometimes it is enough to plug the cable into a laptop, to be surprised at how fast the Internet can be.

To be continued. In the next post we will look at what happens to the client.

Source: https://habr.com/ru/post/335208/

All Articles