If one day you have to quickly figure out what VoIP is (voice over IP) and what all these wild abbreviations mean, I hope this tutorial will help. Immediately, I note that the configuration of additional types of telephony services (such as call transfer, voice mail, conference calls, etc.) are not considered here.
In general, the scheme of connecting a local subscriber to a telephone provider via a normal telephone line is as follows:
On the provider side (PBX), there is a telephone module with an FXS (Foreign eXchange Subscriber) port. A telephone or fax with FXO (Foreign eXchange Office) port and a dialer module is installed at home or in the office.
In appearance, the FXS and FXO ports are no different; they are ordinary 6-pin RJ11 connectors. But using a voltmeter to distinguish them is very simple - there will always be some voltage on the FXS port: 48/60 V when the handset is on, or 6–15 V during a call. On FXO, if it is not connected to the line, the voltage is always 0.
To transfer data via a telephone line on the provider side, additional logic is needed, which can be implemented on the SLIC (subscriber line interface circuit) module, and on the subscriber side - using the DAA (Direct Access Arrangement) module.
Nowadays, wireless DECT phones (Digital European Cordless Telecommunications) are quite popular. On the device, they are similar to ordinary telephones: they also have an FXO-port and a dialer module, but the station and tube wireless communication module at 1.9 GHz is also added.
Subscribers are connected to the PSTN (Public Switched Telephone Network) network - the public telephone network, also known as PSTN, PSTN. The PSTN network can be organized using different technologies: ISDN, optics, POTS, Ethernet. A special case of PSTN when using a regular analog / copper line - POTS (Plain Old Telephone Service) is a simple old telephone system.
With the development of the Internet, the telephone has moved to a new level. Fixed telephone sets are less and less used, mainly for business needs. DECT phones are a bit more convenient, but limited to the perimeter of the house. GSM phones are even more convenient, but limited to the country’s borders (roaming is expensive). But for IP phones, they are also softphones (SoftPhone), there are no restrictions other than access to the Internet.
Skype is the most famous example of a softphone. It can do a lot of things, but it has two important flaws: the closed architecture and the wiretapping are known by which authorities. Because of the first, there is no possibility to create your own telephone microgrid. And because of the second, it is not very pleasant when you are being watched, especially during personal and commercial conversations.
Fortunately, there are open protocols for creating your own communication networks with buns - these are SIP and H.323. There are a few more softphones on the SIP protocol than on H.323, which can be explained by its relative simplicity and flexibility. But sometimes this flexibility can insert big sticks into the wheels. Both SIP and H.323 protocols use the RTP protocol to transfer media data.
Consider the basic principles of the SIP protocol to figure out how the two subscribers are connected.
SIP (Session Initiation Protocol) - a session establishment protocol (not just a telephone one) is a text protocol over UDP. It is also possible to use SIP over TCP, but these are rare cases.
SDP (Session Description Protocol) is a protocol for negotiating the type of transmitted data (for audio and video, these are codecs and their formats, for faxes, the transmission rate and error correction) and their destination addresses (IP and port). This is also a text protocol. SDP parameters are transmitted in the body of SIP packets.
RTP (Real-time Transport Protocol) is an audio / video transmission protocol. This is a binary protocol over UDP.
The general structure of SIP packets:
Here is an example of two SIP packets for one frequent procedure — call setup:
The left shows the contents of the SIP INVITE package, the right - the answer to it - SIP 200 OK.
The main fields are framed:
An SDP message consists of lines containing pairs of FIELD = VALUE. Of the main fields can be noted:
RTP packets contain audio / video data encoded in a specific format. This format is indicated in the PT (payload type) field. The table of the correspondence of the value of this field to a specific format is given in https : // en . wikipedia . org / wiki / RTP _ audio _ video _ profile .
Also in the RTP packets, a unique SSRC identifier is specified (determines the source of the RTP stream) and a timestamp (timestamp, used to play sound or video evenly).
An example of the interaction of two SIP subscribers through a SIP server (Asterisk):
As soon as the SIP-phone starts, the first thing it registers on the remote server (SIP Registar), sends it a SIP REGISTER message.
When a subscriber is called, a SIP INVITE message is sent, in the body of which an SDP message is embedded, which specifies the audio / video transmission parameters (which codecs are supported, which IP and port to send audio, etc.).
When the remote subscriber picks up the phone, we receive a SIP 200 OK message also with the SDP parameters, only the remote subscriber. Using the sent and received SDP parameters, you can set up an RTP audio / video transmission session or a T.38 fax transmission session.
If the received SDP parameters did not suit us, or the intermediate SIP server decided not to pass RTP traffic through itself, then the SDP renegotiation procedure, the so-called REINVITE, is performed. By the way, precisely because of this procedure, free SIP proxy servers have one drawback - if both subscribers are on the same local network and the proxy server is behind NAT, then after the RTP traffic is redirected, none of the subscribers will be hear the other.
After the call is over, the subscriber who hangs up sends a SIP BYE message.
Sometimes after a session is established, during a call, access to additional services (TER) is required - call hold, transfer, voice mail, etc. - which react to certain combinations of pressed buttons.
So, in a regular phone line there are two ways to dial:
During a conversation, the impulse method is inconvenient for transmitting the pressed button. Thus, the transmission of “0” requires approximately 1 second (10 pulses of 100 ms each: 60 ms — line break, 40 ms - line closure) plus 200 ms for a pause between digits. In addition, during pulse dialing, characteristic clicks will often be heard. Therefore, in conventional telephony, only the tonal mode of access to the DVO is used.
In VoIP telephony, information about the pressed buttons can be transmitted in three ways:
DTMF transmission within audio data (Inband) has several drawbacks - these are overhead resources when generating / embedding tones and detecting them, limiting some codecs that can distort DTMF codes, and poor transmission reliability (if some packets are lost, then detection can occur double clicking the same key).
The main difference between DTMF RFC2833 and SIP INFO: if the SIP proxy server has the ability to transfer RTP directly between subscribers bypassing the server itself (for example, canreinvite = yes in asterisk), the server will not notice RFC2833 packets, resulting in unavailable DVO services . SIP packets are always transmitted through SIP proxy servers, therefore, DVOs will always work.
As already mentioned, the RTP protocol is used to transfer media data. In the RTP packets, the format of the transmitted data (codec) is always indicated.
For voice, there are many different codecs, with different ratios of bitrate / quality / complexity, there are open and closed. In any softphone, there is necessarily support for G.711 alaw / ulaw codecs, their implementation is very simple, the sound quality is not bad, but they require a bandwidth of 64 kbps. For example, the G.729 codec requires only 8 kbps, but loads the processor very heavily, and besides, it is not free.
For fax transmission, either the G.711 codec or the T.38 protocol is usually used. Sending faxes using the G.711 codec corresponds to sending a fax using the T.30 protocol, as if the fax is being transmitted over a normal telephone line, but the analog signal from the line is digitized according to alaw / ulaw-law. This is also called Inband T.30 fax transmission.
Faxes using the T.30 protocol coordinate their parameters: transmission speed, datagram size, type of error correction. The T.38 protocol is based on the T.30 protocol, but unlike Inband transmission, the generated and received T.30 commands are analyzed. Thus, it is not the raw data that is transmitted, but the recognized fax control commands.
To transmit T.38 commands, the UDPTL protocol is used, it is a UDP-based protocol, it is used only for T.38. You can also use the TCP and RTP protocols for transmitting T.38 commands, but they are used much less frequently.
The main advantages of T.38 are a reduction in the load on the network and greater reliability compared to the Inband transmission of a fax.
The procedure for sending a fax in T.38 mode is as follows:
Faxing over the Internet is desirable in T.38. If the fax needs to be transmitted inside the office or between objects with a stable connection, you can use the Inband T.30 fax transmission. In this case, before transmitting the fax, the echo cancellation procedure must be disabled in order not to introduce additional distortion.
Very detailed about the transmission of faxes written in the book "Fax, Modem, and Text for IP Telephony", the authors - David Hanes and Gonzalo Salgueiro.
We understood the protocols for establishing a conversation session (SIP / SDP) and the method of transmitting audio over the RTP channel. There is one important question - the sound quality. On the one hand, the sound quality is determined by the selected codec. But on the other hand, additional DSP procedures (DSP - digital signal processing) are needed. These procedures take into account the peculiarities of VoIP-telephony operation: a quality headset is not always used, there are packet drops on the Internet, sometimes packets come unevenly, the network bandwidth is also not rubber.
Basic procedures that improve sound quality:
VAD (Voice activity detector) - a procedure for determining frames that contain a voice (active voice frame) or silence (inactive voice frame). This separation can significantly reduce network load, since the transfer of information about silence requires much less data (you just need to transmit the noise level or not to transmit anything at all).
Some codecs already contain VAD procedures (GSM, G.729), for others (G.711, G.722, G.726) they need to be implemented.
If the VAD is configured to transmit information about the noise level, then special SID packets (Silence Insertion Descriptor) are transmitted in the 13m CN (Comfort Noise) RTP format.
It is worth noting that SID packets can be dropped by SIP proxy servers, so for testing it is advisable to configure the transfer of RTP traffic past SIP servers.
CNG (comfort noise generation) - a procedure for generating comfort noise based on information from SID packets. Thus, VAD and CNG work in conjunction, but the CNG procedure is much less demanded, since it is not always possible to notice the work of the CNG, especially at low volume.
PLC (packet loss concealment) - the process of restoring the audio stream with packet loss. Even with a 50% packet loss, a good PLC algorithm makes it possible to achieve acceptable speech quality. Distortion, of course, will be, but you can make out the words.
The easiest way to emulate packet loss (in Linux) is to use the tc utility from the iproute package with the netem module . It performs shaping of outgoing traffic only.
An example of starting network emulation with a loss of 50% of packets:
tc qdisc change dev eth1 root netem loss 50%
Disable emulation:
tc qdisc del dev eth1 root
Jitter buffer is a procedure for getting rid of the jitter effect, when the interval between received packets varies greatly, and that in the worst case leads to the wrong order of received packets. This effect also leads to speech interruptions. To eliminate the jitter effect, it is necessary on the received side to implement a packet buffer with a size sufficient to restore the original order of sending packets at a specified interval.
The jitter effect can also be emulated using the tc utility (the interval between the expected arrival time of the packet and the actual one can reach 500 ms):
tc qdisc add dev eth1 root netem delay 500ms reorder 99%
LEC (Line Echo Canceller) —a procedure for eliminating a local echo when the remote subscriber begins to hear his own voice. Its essence is to subtract from the transmitted signal the received signal with a certain coefficient.
Echo can occur for several reasons:
To find out the cause (acoustic or electric echo) is easy: the subscriber on whose side an echo is created must switch off the microphone. If an echo arises anyway, it means it is electric.
For more information about VoIP and DSP procedures, see VoIP Voice and Fax Signal Processing. A preview is available on Google Books .
This completes the theoretical theoretical overview of VoIP. If interested, then an example of practical implementation of a mini-PBX on a real hardware platform can be considered in the next article.
[!?] Questions and comments are welcome. They will be answered by the author of the article, Dmitry Valento, a software engineer at the Promwad electronics design center .
Source: https://habr.com/ru/post/188336/