
VoIP is an umbrella term. A set of technologies, protocols and simple buzzwords that relate to voice transmission (and video!) Over computer networks (local or Internet) instead of telephone. And yes, most telecom providers still use their own networks for voice transmission instead of the Internet. With expensive boxes where T1 and E1 wires are plugged.
Most often, for IT people who do not work in telecom, VoIP is a combination of RTP / RTCP for voice / video plus SIP - for agreeing on who and how to transmit. This bundle allows you to connect office "SIP phones" to Bitrix24 or Asterisk. Both protocols can work on both TCP and UDP. There are no questions with voice over RTP: with the rarest exception, the UDP protocol is used, and codecs compensate for lost packets, so that the interlocutor almost does not “croak” even on the best communication channel. But with SIP history is more sad.
In general, SIP is such an HTTP for telephony. About him on Habré there is a wonderful
two-part post . If anyone is interested, highly recommend. The key difference from HTTP is that SIP should work both ways at all times. The task “send a message from the server to the client” was solved in HTTP for many years, it went through AJAX, WebSockets, and now it has finally stabilized as HTTP / 2. And in SIP this task had to be solved right away: not only the telephone set sent a request “here I am, ready to receive calls”, but the server should have the opportunity at any time to inform the client “but you will receive the call?”.
')
And TCP, for all its merits, has its drawbacks. For example, compounds can "rot". If you do not put a small keep-alive (and SIP was created in 1996) and do not constantly send a ping, the “connection” can break off on one side, and the other side will not notice. Well, if it was cut off by the client, he noticed it and reconnected. And if it broke off from the server? You need to wait for the client to keep alive or some other timeout and it will reconnect. And at this time the customer call.

And 20 years ago, when it was all created, there was a very sad issue of reconnection. For the UDP version of SIP, clients need to send REGISTER packets from time to time (once an hour), saying “here I am, I am ready to receive calls”. If the server is restarted or shut down, another server with the same IP address can work with all these clients. But in the case of TCP clients will need to reconnect. Ten thousand phones of the building (the usual situation) went to reconnect and killed the server ...
In general, everyone expected SIP to be over TCP. And he became on UDP. Now, 20 years later, more and more companies are switching to
“tcp only” . Nevertheless, a huge fleet of devices, customers and infrastructure went on UDP, go on UDP and the next few years will go.

And in the combination of "SIP and UDP" there is trouble. Actually, this post is about her. The trouble is called "fragmentation." By itself, the fragmentation of network packets is almost harmless: there is an MTU configured for the network adapter, the maximum size of the packet transmitted at a time. In most cases, this is 1500 bytes. For TCP and the vast majority of UDP protocols, this has no effect: they just never send packets more than MTU and use internal mechanics to transfer large amounts of data in pieces, lose these pieces and send them again if necessary.
And SIP is like HTTP. Textual. And in recent years, a bunch of additional fields and all XML / JSON in payloads have been added to simple commands like “I'm here” and “you call”. The size of the resulting packets began to regularly go beyond the MTU. From this point on, the story becomes quite sad.
Modern Internet - in fact, it is not very well adapted to fragmented network packets. Equipment manufacturers and software authors reasonably believe that most software does not need such packages. TCP fragments itself, UDP is realtime commands, video and sound. They do not need to be more MTU. If they get lost, nothing terrible will happen. And they don't really care how well their solutions support fragmented packets.
Switches can
crush . "Corporate" hardware, making their add-ons over ethernet / ip, cannot be fragmented and silently
ignore packets. And for
some pieces of
hardware, the buffer for assembling fragmented packets is as much as 200 pieces. It overflows surprisingly easily.
We see all this in technical support, when “calls are working, but they are not working”. It is good when there is a team of engineers who can request SIP-logs from the telecom operator to carefully read and stick a finger into the “problem” area. Much worse when the company has one Asterisk and a remote administrator who serves it. In this case, "just magic, sometimes calls do not reach."
Now everyone is trying to use SIP over TCP, but the problem of infrastructure lagging over the past few years has become very acute. More and more video and voice calls are made through browsers (which is only for the recently released Skype for the Web). We know perfectly well how fast the web is developing: run-in developers brought with them the familiar JSON and XML payloads, and videoconferences for 20-30 people ceased to be a lot of large companies, where bearded admins could tweak for weeks to work.
Conclusion: you will see SIP and UDP in one place - be careful. And as soon as possible switch to TCP!
The links are borrowed from this article , which addresses the problem much deeper and also recommends using TCP if possible.And the picture to attract attention - from here