We fix the errors with their own hands, or a bug that "no one shakes"

Recently, I already raised a wave about the TCP stream streaming camera bug, but then I rolled it exclusively to China. And the problem is much wider. For myself, I solved the problem, I upload firmware with fixes to those who suffer.

Now sit back, I will tell you about this bug in detail.

Briefly about the bug

For those who have not read the previous spreading of thought on the tree, and just to formalize the problem, I will briefly sign for the bug itself.
So you have a camera. You put it, connect, look - and everything is fine. Now you leave it for a while - and start watching video artifacts (for various reasons, packets are lost). You say aha! and switch the camera from UDP to TCP (they should not get lost there!). And you observe a more interesting picture - with enviable regularity, the connection simply disappears. At the same time, the network is in order, no losses are visible, nothing is visible at all - but the camera falls off regularly ...

Nature of appearance

So, there is some kind of camera broadcasting stream to the network via RTSP.
Broadcasting can be carried out via UDP, and then each network packet is self-sufficient (more or less). The protocol and everything else must be ready at any moment that any of the packets will be lost, or order will be corrupted. The protocol is designed for this, customers too.
Broadcasting can be done over TCP. Since TCP is a streaming protocol with a delivery guarantee, it is not divided into packets (in theory), labels of each frame with a length of them are added to the protocol. This allows TCP to be represented as UDP - we simply read the marker, the length, and after that read the required length bytes => received the packet, the task is reduced to the previous one.
But there are a couple of nuances: if the channel between the camera and the client is thinner than the stream created by the camera, then if on UDP just the packets disappear, then on TCP, the re-requests will begin. Outgoing packets will begin to accumulate in the camera's memory. Therefore, at a certain point, the memory of the camera will end.
To prevent this from happening, all the server-side vendors do anyway throwing data away if they do not fit into the outgoing hole.
But this throwing should be done correctly - if the UDP packets are lost entirely, then the TCP packet freezes at an arbitrary moment. And here begins the very magic that is the root of all evil.
')
The most famous opensource RTSP streamer is the Live555 server.
In one form or another, it underlies many other derivatives used in the margins of camera manufacturers.

Consider a textbook implementation of sending packets on top of TCP, which is still found in the cameras even in their original form.

Let's take a look at the sendRTPOverTCP function: sending is implemented "head on." Send the start tag of the package '$'. Sending the channel number in the next byte. Then we form length and we send the next two bytes. And finally, we send the entire packet (which would have gone to UDP with one send () 'om).

Each send is checked to ensure that the data is sent (the socket is in non-blocking mode, so only what can be sent is sent). If send () returned a non-original packet length - an error, exit the send function. Drop a package.
Is it dropping? Not!

So let's start with the fact that sending one packet is 4 separate send () 's. And an error on any of them, all the rest will not be caused. That is, it may happen that only $ is sent, and nothing more. Either $ and the channel number will go, but the length is not. Either $, channel number and length are sent, but the package itself is not. Or…

send () in non-blocking mode copies to the outgoing buffer / tries to send the transmitted packet. Returns the number of bytes that are gone, or lay down in the send buffer. Once again: the number of bytes that are GOED or LEGLED in the send buffer.

Thus, it turns out that the package floor can also be sent. Either half the length ... As a result, the stream being sent will be broken, since one packet does not leave entirely. Simple customers who simply read $ + channel + length + packet_nught_lengths will break in these places - the length will be the most diverse, or after reading the entire package, there will be no further $ (since we read more than was stated there).

At one point, the bug was noticed, and "fixed". We look at a more recent implementation : sending is done in two steps instead of four, the packet prefix is collected first, and then the packet itself is sent. Moreover, the sending is performed by the special function sendDataOverTCP , which must guarantee the sending of the entire packet, and return whether there were problems with the sending.

Did not work out. Find yourself why?

The "guaranteed" send algorithm: we do send () on a non-blocking socket. If it returns an error, we switch the socket to a blocking one, and send it in blocking mode. Then we return the socket to non-blocking, and we inform you that there was an error.

Once again: sure to find an error? Everything? ;)

So, the main error: the first send () has already sent something! Thus, by making send () in blocking mode, we re-send the beginning of the packet!

We have two shipments. And the first one, although it transmits forceSendToSucceed == False, could also transmit something. Narpimer, 3 bytes - $, packet number and low byte of length. Then there is an error; data is not sent, then the next packet is sent, and its $ is coming as the high byte of length ...

Is the bug will be eternal? Not! In December of the 13th year, the bug was “fixed”. Here is the final version .

It seems they have foreseen everything: if nothing has gone, then they return an error, and the package does not go away entirely. If something has gone, only the remainder is sent in blocking mode, and if the remainder is gone completely, then we return success. Thus, the next step will be sending the package data. And everything's good".

Well now what is wrong? And here is what: the package leaves entirely, always, in blocking mode. Thus, the problem with sending to one client causes brakes on all clients connected to the camera.
And still, sendPacket () now does not drop anything, if at least one byte and got into the outgoing packet. And since nobody leveled the packet sizes, the matching of the size of the outgoing buffer and the multiplicity of its sent packets is not equal, it turns out that if there are problems sending, there will simply be no drop package situation ...

Well, at least the flow will not break. Thanks and on this one. The main thing is that OOM is not to be caught during this time. But the video will start to fall behind ...

In other words, I consider the final decision in Live555 to be incorrect ...

The correct solution (and quite simple!) I will describe next time, to whet the reader’s interest :) [upd: next time ]

Distribution area

So, the bug is widespread. All the errors that we saw above in the Live555 code are not out of the ordinary - these are standard errors that absolutely all programmers working with the network repeat.

A bug is spotted on the sea of Chinese cameras; not only on those based on Live555. Bug met in D-Link cameras. The bug was met in a variety of branded cameras (which, as always, are based on various modules of Chinese manufacturers).

The probability of getting problems increases with increasing resolution and bit rate of the camera. It is for this reason that it has remained unnoticed for a long time: the ratio of the resolution to the price of cameras has started to grow in capital recently, FullHD and thicker cameras are beginning to be in demand. And thanks to the prices of the Chinese, it is on them that they most often begin to be noticed. As always, they begin to sin on the Chinese ... Although the mistakes are made not by Chinese programmers.

Diagnostics

If you have cameras and monitoring software, switch it to TCP mode for a while. If, despite a stable connection, communication breaks are observed, or if on an unstable connection, instead of the whole video or regular video artifacts, the software drops or loses the connection, you have a buggy client and camera.

To poke only your camera with a stick, you can use my script .
It is not intended for end users, as nakidan on the knee.

At the beginning of the script, the following parameters are set: host - IP cameras, url - full path in the URL of the stream to the right track. With a nonstandard port, you can override the port.

Parameters dump allows you to write an RTP stream to a video that can be viewed by the same mplayer; dumpraw allows you to write a raw stream as is.

To increase the frequency of churning, you can uncomment line 112 (with “time.sleep (st)”). And in line 176, if allows you to select the stream recovery mode. If False, the stream is resynchronized; when True, a full reconnection is performed. This allows you to estimate the time difference.

Treatment methods

So, there is a bug. Distributed very widely, but began to emerge in recent times - a huge amount of iron is already in the wild with this bug. How to treat in this case?

My personal opinion: the treatment should be bilateral. At the same time, it is necessary to edit the camera firmware, so that the new hardware goes without a bug, and it is possible to update the firmware of existing cameras.

But customers must be able to live in the condition of possible flow breakdowns. Failure flow = exceed => just look for the beginning of the next whole packet. In essence, this reduces the situation to UDP, only the integrity control of the packet drop-down passes to the application.

To fix the bug on the client’s side, you need to torment the support of each of these clients. macroscop already unsubscribed in the last post, and maybe they will read this. I am now completely transferred to AxxonNext - their support has been notified, however, I have been tormenting them with this bug for a long time. Users who are faced with this problem - I urge you to create tickets and ask the producers of your software to take action on their part. Erlyvideo recently added support for restoring a stream after a crash from my feed.

It does not follow the resynchronization code, implemented in my test script, to be considered optimal - it is quick to write; it is possible to implement a more correct (resynchronizing earlier and more correctly) and fast, however, it is perfect as a starting point, as well as a proof-of-concept.

To fix the bug on the part of the cameras, I tried to torture everyone I reached out to: I wrote to a Chinese seller, who are collecting from these modules. I wrote to other vendors. I tried to write to the camera manufacturer. I tried to write to the manufacturer embedded linux, which is on camera. I wrote on Habr.

Unfortunately, the result is zero: I'm too shallow. Fortunately, Andrey Syomochkin (deepweb) , who works at ipeye.ru, knocked me

IP EYE provides cloud-based video storage and its own cameras, based on exactly the same modules as I have on external cameras - on modules from TopSee (TS38). They greatly rework the interface and firmware functionality of these cameras. However, as I understand it, there is no source code for the original firmware; they go through the existing cameras, collecting the necessary modules, replacing software, etc. Since they provide the cloud, most cameras connect directly to the Internet. Using such remote cameras, the use of UDP becomes unpleasant - the probability of retransmit is too high, although the thickness of the channel is enough for the eyes. As a server receiver, erlyvideo (in the sense of flussonic ) is used. Using the old version of flussonic (without resynchronization) the number of falling off of various cameras is simply huge. Using the updated version (with resynchronization) significantly reduces the number of reconnects (although the volume of losses is still unpleasant).

So, distracted. Andrei offered to test the streamer that I corrected at his stand ... Thus, I had access to the test chambers much easier than trying to compile the source from sigrand .

Treatment

Unfortunately, the current article has already appeared enormous, so the treatment method that I applied will be described the other day in a separate article.

In short: unpack the firmware, get a streamer from there, look for a bad place, patch in the binary, rebuild the firmware back.

Bonus Bugs

While dealing with this bug, Max Lapshin (@erlyvideo) ran into another interesting bug. The camera as a whole looks beautiful, they have applied the correct buffering method — if the package does not go all at once, it remains in the buffer and tries to be sent later as time approaches; all the packets that have not yet been left are delayed silently dropping (and dropping entirely - including $ / channel / size and packet numbers), so it looks like a beautiful drop trap similar to UDP.

Here is the only side effect of this buffering method: the answer to keep-alive requests passing through the TCP stream (GET_PARAMETER or OPTIONS) comes as soon as the server receives the request. Immediately. How to get a request. That is, the answer can come even in the middle of a data packet! Thus, if you try to decode the stream “as is” - $ + packet + length ... and wait for the RTSP instead of $ - every time we send keepalive (every 30 seconds, since without them the camera tears the stream after 45 seconds), the stream breaks - ~ 100 bytes of garbage arrive in response, and the cant gets out of the video, which is corrected by the following key file.

The client’s treatment of this bug is more difficult: you must first search and analyze (delete) RTSP ... \ r \ n \ r \ n first, and only then search for $ + package + length + video for the remainder. Flussonic latest version already knows how to do it.

Pills

Details of the correction (what and how was done) will be later, but for now, if your camera is built on the basis of TS38, you can take and put the firmware with my changes, where the problem with TCP is completely cured.

So, I have compiled the following firmware latest version 2.5.0.6:

If suddenly you need to fix on any other module of the same manufacturer - write in the comments, I will look if possible.

Source: https://habr.com/ru/post/219491/

All Articles