IP telephony: from copper wires to digital signal processing

If one day you have to quickly figure out what VoIP is (voice over IP) and what all these wild abbreviations mean, I hope this tutorial will help. Immediately, I note that the configuration of additional types of telephony services (such as call transfer, voice mail, conference calls, etc.) are not considered here.

So, with what we will understand under a cat:

Basic concepts of telephony: types of devices, wiring diagrams
SIP / SDP / RTP protocol bundle: how it works
How information about the pressed buttons is transmitted
How is the transmission of voice and faxes
Digital signal processing and quality assurance in IP telephony

1. Basic telephony concepts

In general, the scheme of connecting a local subscriber to a telephone provider via a normal telephone line is as follows:

On the provider side (PBX), there is a telephone module with an FXS (Foreign eXchange Subscriber) port. A telephone or fax with FXO (Foreign eXchange Office) port and a dialer module is installed at home or in the office.

In appearance, the FXS and FXO ports are no different; they are ordinary 6-pin RJ11 connectors. But using a voltmeter to distinguish them is very simple - there will always be some voltage on the FXS port: 48/60 V when the handset is on, or 6–15 V during a call. On FXO, if it is not connected to the line, the voltage is always 0.

To transfer data via a telephone line on the provider side, additional logic is needed, which can be implemented on the SLIC (subscriber line interface circuit) module, and on the subscriber side - using the DAA (Direct Access Arrangement) module.

Nowadays, wireless DECT phones (Digital European Cordless Telecommunications) are quite popular. On the device, they are similar to ordinary telephones: they also have an FXO-port and a dialer module, but the station and tube wireless communication module at 1.9 GHz is also added.

Subscribers are connected to the PSTN (Public Switched Telephone Network) network - the public telephone network, also known as PSTN, PSTN. The PSTN network can be organized using different technologies: ISDN, optics, POTS, Ethernet. A special case of PSTN when using a regular analog / copper line - POTS (Plain Old Telephone Service) is a simple old telephone system.

With the development of the Internet, the telephone has moved to a new level. Fixed telephone sets are less and less used, mainly for business needs. DECT phones are a bit more convenient, but limited to the perimeter of the house. GSM phones are even more convenient, but limited to the country’s borders (roaming is expensive). But for IP phones, they are also softphones (SoftPhone), there are no restrictions other than access to the Internet.

Skype is the most famous example of a softphone. It can do a lot of things, but it has two important flaws: the closed architecture and the wiretapping are known by which authorities. Because of the first, there is no possibility to create your own telephone microgrid. And because of the second, it is not very pleasant when you are being watched, especially during personal and commercial conversations.

Fortunately, there are open protocols for creating your own communication networks with buns - these are SIP and H.323. There are a few more softphones on the SIP protocol than on H.323, which can be explained by its relative simplicity and flexibility. But sometimes this flexibility can insert big sticks into the wheels. Both SIP and H.323 protocols use the RTP protocol to transfer media data.

Consider the basic principles of the SIP protocol to figure out how the two subscribers are connected.

2. Description of the SIP / SDP / RTP protocol bundle

SIP (Session Initiation Protocol) - a session establishment protocol (not just a telephone one) is a text protocol over UDP. It is also possible to use SIP over TCP, but these are rare cases.

SDP (Session Description Protocol) is a protocol for negotiating the type of transmitted data (for audio and video, these are codecs and their formats, for faxes, the transmission rate and error correction) and their destination addresses (IP and port). This is also a text protocol. SDP parameters are transmitted in the body of SIP packets.

RTP (Real-time Transport Protocol) is an audio / video transmission protocol. This is a binary protocol over UDP.

The general structure of SIP packets:

Start-Line: field indicating the SIP-method (command) at the request or the result of the SIP-method when responding.
Headers: additional information to the Start-Line, arranged in the form of lines containing pairs of ATTRIBUTES: VALUE.
Body: binary or textual data. Usually used to transfer SDP parameters or messages.

Here is an example of two SIP packets for one frequent procedure — call setup:

The left shows the contents of the SIP INVITE package, the right - the answer to it - SIP 200 OK.

The main fields are framed:

Method / Request-URI contains SIP-method and URI. In the example, the session is established - the INVITE method, the call to the subscriber is 555@192.168.1.200.
Status-Code - the response code to the previous SIP-command. In this example, the command ran successfully - code 200, i.e. the subscriber 555 picked up the phone.
Via - the address where the subscriber 777 is waiting for a response. For the 200 OK message, this field is copied from the INVITE message.
From / To - the displayed name and address of the sender and recipient of the message. For the 200 OK message, this field is copied from the INVITE message.
Cseq contains the sequence number of the command and the name of the method to which the message belongs. For the 200 OK message, this field is copied from the INVITE message.
Content-Type is the type of data that is transmitted in the Body block, in this case SDP data.
Connection Information - The IP address to which the second subscriber needs to send RTP packets (or UDPTL packets in case of T.38 fax transmission).
Media Description - the port to which the second subscriber should transmit the specified data. In this case, it is a sound (audio RTP / AVP) and a list of supported data types - PCMU, PCMA, GSM codecs and DTMF signals.

An SDP message consists of lines containing pairs of FIELD = VALUE. Of the main fields can be noted:

o - Origin, the name of the session organizer and the session identifier.
c - Connection Information, field described earlier.
m - Media Description, field described earlier.
a - media attributes, specify the format of the transmitted data. For example, indicate the direction of the sound - receive or transmit (sendrecv), for codecs indicate the sampling frequency and the reference number (rtpmap).

RTP packets contain audio / video data encoded in a specific format. This format is indicated in the PT (payload type) field. The table of the correspondence of the value of this field to a specific format is given in https : // en . wikipedia . org / wiki / RTP _ audio _ video _ profile .

Also in the RTP packets, a unique SSRC identifier is specified (determines the source of the RTP stream) and a timestamp (timestamp, used to play sound or video evenly).

An example of the interaction of two SIP subscribers through a SIP server (Asterisk):

As soon as the SIP-phone starts, the first thing it registers on the remote server (SIP Registar), sends it a SIP REGISTER message.

When a subscriber is called, a SIP INVITE message is sent, in the body of which an SDP message is embedded, which specifies the audio / video transmission parameters (which codecs are supported, which IP and port to send audio, etc.).

When the remote subscriber picks up the phone, we receive a SIP 200 OK message also with the SDP parameters, only the remote subscriber. Using the sent and received SDP parameters, you can set up an RTP audio / video transmission session or a T.38 fax transmission session.

If the received SDP parameters did not suit us, or the intermediate SIP server decided not to pass RTP traffic through itself, then the SDP renegotiation procedure, the so-called REINVITE, is performed. By the way, precisely because of this procedure, free SIP proxy servers have one drawback - if both subscribers are on the same local network and the proxy server is behind NAT, then after the RTP traffic is redirected, none of the subscribers will be hear the other.

After the call is over, the subscriber who hangs up sends a SIP BYE message.

3. Transfer information about the pressed buttons

Sometimes after a session is established, during a call, access to additional services (TER) is required - call hold, transfer, voice mail, etc. - which react to certain combinations of pressed buttons.

So, in a regular phone line there are two ways to dial:

Impulse - historically the first, was used mainly in phones with a dialing device. Dialing occurs by sequential circuit and disconnection of the telephone line according to the dialed digit.
Tone - dialing DTMF-codes (Dual-Tone Multi-Frequency) - each button of the phone has its own combination of two sinusoidal signals (tones). By performing the Goertzel algorithm, you can quite easily determine which button is pressed.

During a conversation, the impulse method is inconvenient for transmitting the pressed button. Thus, the transmission of “0” requires approximately 1 second (10 pulses of 100 ms each: 60 ms — line break, 40 ms - line closure) plus 200 ms for a pause between digits. In addition, during pulse dialing, characteristic clicks will often be heard. Therefore, in conventional telephony, only the tonal mode of access to the DVO is used.

In VoIP telephony, information about the pressed buttons can be transmitted in three ways:

DTMF Inband - the generation of an audio tone and its transmission within audio data (the current RTP channel) is a regular tone dialing.
RFC2833 — A special telephone-event RTP packet is generated, containing information on the key pressed, volume, and duration. The number of the RTP format in which DTMF RFC2833 packets will be transmitted is indicated in the body of the SDP message. For example: a = rtpmap: 98 telephone-event / 8000.
SIP INFO - a SIP INFO packet is created with information about the key pressed, volume and duration.

DTMF transmission within audio data (Inband) has several drawbacks - these are overhead resources when generating / embedding tones and detecting them, limiting some codecs that can distort DTMF codes, and poor transmission reliability (if some packets are lost, then detection can occur double clicking the same key).

The main difference between DTMF RFC2833 and SIP INFO: if the SIP proxy server has the ability to transfer RTP directly between subscribers bypassing the server itself (for example, canreinvite = yes in asterisk), the server will not notice RFC2833 packets, resulting in unavailable DVO services . SIP packets are always transmitted through SIP proxy servers, therefore, DVOs will always work.

4. Voice and fax transmission

As already mentioned, the RTP protocol is used to transfer media data. In the RTP packets, the format of the transmitted data (codec) is always indicated.

For voice, there are many different codecs, with different ratios of bitrate / quality / complexity, there are open and closed. In any softphone, there is necessarily support for G.711 alaw / ulaw codecs, their implementation is very simple, the sound quality is not bad, but they require a bandwidth of 64 kbps. For example, the G.729 codec requires only 8 kbps, but loads the processor very heavily, and besides, it is not free.

For fax transmission, either the G.711 codec or the T.38 protocol is usually used. Sending faxes using the G.711 codec corresponds to sending a fax using the T.30 protocol, as if the fax is being transmitted over a normal telephone line, but the analog signal from the line is digitized according to alaw / ulaw-law. This is also called Inband T.30 fax transmission.

Faxes using the T.30 protocol coordinate their parameters: transmission speed, datagram size, type of error correction. The T.38 protocol is based on the T.30 protocol, but unlike Inband transmission, the generated and received T.30 commands are analyzed. Thus, it is not the raw data that is transmitted, but the recognized fax control commands.

To transmit T.38 commands, the UDPTL protocol is used, it is a UDP-based protocol, it is used only for T.38. You can also use the TCP and RTP protocols for transmitting T.38 commands, but they are used much less frequently.

The main advantages of T.38 are a reduction in the load on the network and greater reliability compared to the Inband transmission of a fax.

The procedure for sending a fax in T.38 mode is as follows:

A normal voice connection is established on any codec.
When paper is loaded into the sending fax, it periodically sends a T.30-signal CNG (Calling Tone), which means it is ready to send a fax.
On the receiving side, a T.30-generated CED (Called Terminal Identification) signal is generated - this is the willingness to receive a fax. This signal is sent either after pressing the “Receive Fax” button or the fax does it automatically.
On the sending side, a CED signal is detected and the SIP REINVITE procedure occurs, and the SDP message indicates T.38 type: m = image 39164 udptl t38.

Faxing over the Internet is desirable in T.38. If the fax needs to be transmitted inside the office or between objects with a stable connection, you can use the Inband T.30 fax transmission. In this case, before transmitting the fax, the echo cancellation procedure must be disabled in order not to introduce additional distortion.

Very detailed about the transmission of faxes written in the book "Fax, Modem, and Text for IP Telephony", the authors - David Hanes and Gonzalo Salgueiro.

5. Digital Signal Processing (DSP). Ensuring the quality of sound in IP-telephony, examples of testing

We understood the protocols for establishing a conversation session (SIP / SDP) and the method of transmitting audio over the RTP channel. There is one important question - the sound quality. On the one hand, the sound quality is determined by the selected codec. But on the other hand, additional DSP procedures (DSP - digital signal processing) are needed. These procedures take into account the peculiarities of VoIP-telephony operation: a quality headset is not always used, there are packet drops on the Internet, sometimes packets come unevenly, the network bandwidth is also not rubber.

Basic procedures that improve sound quality:

VAD (Voice activity detector) - a procedure for determining frames that contain a voice (active voice frame) or silence (inactive voice frame). This separation can significantly reduce network load, since the transfer of information about silence requires much less data (you just need to transmit the noise level or not to transmit anything at all).

Some codecs already contain VAD procedures (GSM, G.729), for others (G.711, G.722, G.726) they need to be implemented.

If the VAD is configured to transmit information about the noise level, then special SID packets (Silence Insertion Descriptor) are transmitted in the 13m CN (Comfort Noise) RTP format.

It is worth noting that SID packets can be dropped by SIP proxy servers, so for testing it is advisable to configure the transfer of RTP traffic past SIP servers.

CNG (comfort noise generation) - a procedure for generating comfort noise based on information from SID packets. Thus, VAD and CNG work in conjunction, but the CNG procedure is much less demanded, since it is not always possible to notice the work of the CNG, especially at low volume.

PLC (packet loss concealment) - the process of restoring the audio stream with packet loss. Even with a 50% packet loss, a good PLC algorithm makes it possible to achieve acceptable speech quality. Distortion, of course, will be, but you can make out the words.

The easiest way to emulate packet loss (in Linux) is to use the tc utility from the iproute package with the netem module . It performs shaping of outgoing traffic only.

An example of starting network emulation with a loss of 50% of packets:

tc qdisc change dev eth1 root netem loss 50%

Disable emulation:

 tc qdisc del dev eth1 root

Jitter buffer is a procedure for getting rid of the jitter effect, when the interval between received packets varies greatly, and that in the worst case leads to the wrong order of received packets. This effect also leads to speech interruptions. To eliminate the jitter effect, it is necessary on the received side to implement a packet buffer with a size sufficient to restore the original order of sending packets at a specified interval.

The jitter effect can also be emulated using the tc utility (the interval between the expected arrival time of the packet and the actual one can reach 500 ms):

 tc qdisc add dev eth1 root netem delay 500ms reorder 99%

LEC (Line Echo Canceller) —a procedure for eliminating a local echo when the remote subscriber begins to hear his own voice. Its essence is to subtract from the transmitted signal the received signal with a certain coefficient.

Echo can occur for several reasons:

acoustic echo due to poor audio path (sound from the speaker gets into the microphone);
electrical echo due to impedance mismatch between the telephone and the SLIC module. In most cases, this occurs in 4-wire telephone line to 2-wire conversion circuits.

To find out the cause (acoustic or electric echo) is easy: the subscriber on whose side an echo is created must switch off the microphone. If an echo arises anyway, it means it is electric.

For more information about VoIP and DSP procedures, see VoIP Voice and Fax Signal Processing. A preview is available on Google Books .

This completes the theoretical theoretical overview of VoIP. If interested, then an example of practical implementation of a mini-PBX on a real hardware platform can be considered in the next article.

[!?] Questions and comments are welcome. They will be answered by the author of the article, Dmitry Valento, a software engineer at the Promwad electronics design center .

Source: https://habr.com/ru/post/188336/

All Articles