The insides of the protocol, which browsers transmit voice and video

WebRTC, the technology of voice and video calls in browsers (and even realtime transmission of arbitrary data, peer-to-peer NAT penetration and screen capture) has never been easy. Long history, incompatibility between browsers, confusing documentation, many solvable problems and protocols used. The ability to call and take a call from the browser has always been one of the key "chips" of our Voximplant platform, and since we are good at understanding this, we try to follow interesting articles and adapt them to the Habr audience. Under the cut translation of a fresh article from the guys from callstats.io - a service for collecting statistics on the quality of calls for the browser. In a small article they talk about the RTP protocol with which, in fact, the browser and transmits packets with voice or video.

WebRTC uses the Secure Real-time Transport Protocol (SRTP) protocol to encrypt, authenticate and authenticate messages as well as protect RTP data from replay attacks. This is a security system that gives privacy by encrypting the RTP load and authentication. These security tokens in WebRTC are key to reliability and the foundation for everything related to RTP (Real-time Transport Protocol). But what is RTP, how does it work?

What is RTP?

RTP is a network protocol designed for multimedia communications (VoIP, video conferencing, telepresentation), multimedia streaming (video on demand, live broadcasts) and broadcast media. The protocol was defined by the Internet Engineering Task Force ( IETF ) in RFC1889. Initially, RTP was created to support video conferencing, in which there are geographically distributed participants, the development was led by the IETF working group on audio and video transport. At the moment, the v2 version of the RFC3550 standard has been in use for 15 years!

RTP is based on the fundamental principles of framing at the application level and their processing at the protocol level. RTP describes the types of media data and "payload" of packets, the mechanism for synchronizing media streams, explains what to do with lost and messed packets, how to track the status of transmitted media data.
')
For information about the quality of the media stream inside RTP, the "nested" protocol RTCP (RTP Control Protocol) is used.

When using RTP, the sending side packs the media stream into the RTP packet format and from time to time sends the " RTCP Sender Report " to synchronize the media streams with each other. The receiving party organizes " Jitter buffer " to collect the received packets in the correct order and to play the media stream in accordance with the timing information specified in the received packets. If the packet is lost, the receiving party receives it again if possible, or “hides” the problem by interpolating the sound or breaking the video into colored squares. Finally, the receiving party transmits coarse or detailed statistics using the “ RTCP Receiver Report ”. Statistics allows the sender to choose the bitrate, change the codecs and choose the amount of error correction.

RTP packet header format

The RTP packet header is divided into 4 parts: synchronization source, time stamp, sequence number, and payload type.

1. Sync source. Allows you to determine where the media stream comes from. Especially useful when the source sends multiple media streams that need to be synchronized.

2. RTP timestamp allows you to collect media frames from RTP packets and play a media stream.

3. RTP sequence number: it is also a sequence number in Africa, with its help there are lost packets, and those that are not lost are arranged in order. UDP after all.

4. The type of payload determines the encoding of the media data in packets, it is indicated by the codec.

RTCP reports

Known in the specification as "RTCP Reports", there are three types: "Sender Reports" for the sender, "Receiver Reports" for the recipient and "Extended Reports" for all participants in the process.

RTCP Sender Reports

Used by the sending side to synchronize media streams. The time stamps of all sent streams are set relative to the clock of this computer, so that the receiving party understands how the streams need to be played back relative to each other. The same report indicates the number of packets and bytes sent per second.

RTCP Receiver Reports

The receiving party inspects the received streams and reports on what is happening with the help of the RTCP Receiver Report packages. The report indicates the current level of packet loss, jitter (the buffer in which the packets are stored before playing to wait for those who are late and swap entangled), the maximum sequence number. Some of this data is used to calculate the round trip time.

RTCP Extended Reports

They are used by both the sending and the receiving parties to transfer complex metrics about what is happening between them. These metrics include the performance of the computers themselves, network status, jitter buffer, variations in packet delays, just information about delays, the number of unprocessed packets, QoS, and others. You can also add your own metrics to this package, so both sides can track application-specific parameters.

What are the payload formats for RTP?

The payload format, payload format, is defined by such a thing, which in the specification is called encoding, encoding. Untranslatable into Russian word games describe three options. This may be a codec, such as H.264, H.263, H.261, MPEG-2, JPEG, G.711, G.722 or AMR. This could be a “general purpose payload,” such as Forward Error Correction (FEC), NACK, and other scary acronyms. And finally, it can be multiplexed media streams (several media streams within one).

The specification strictly specifies the format for codecs and defines two rules: aggregation and fragmentation . Aggregation rules describe how RTP works with codecs that produce packets less than MTU — for example, audio codecs. Fragmentation rules , on the contrary, describe working with codecs that prefer large packages, for example, packages with video encoding I-frames. RTP sets its own fragmentation, because IP fragmentation for UDP usually does not work, and NATs with Firewalls just silently drop such packets.

What are Header Extensions used for in RTP?

“Extensions” of packet headers are used for information that is not related to media streams. This is usually the information that needs to be transmitted in real time - more often than RTCP reports are sent.

For example, for interactive media streams (video chat?) RTP packets are sent every few tens of milliseconds. An extension to the RTP headers can be used to indicate lost and received packets — to respond faster than RTCP reports received from time to time with NACK / ACK.

The header extension is backward compatible: if one of the data transfer participants does not understand this format, then it will simply ignore the relevant part of the packet header. The headers are described in the specification as a “general purpose” piece and do not need to be separately indicated for each codec used.

They are often used to transfer network status and application-specific pieces like audio volume for several channels in a conference.

What is the “reporting interval” for RTCP?

Using the RTP protocol looks like a closed loop: we send RTP packets and receive RTCP packets with feedback. Almost like TCP with its ACK. Typically, the reporting interval is chosen so that the volume of RTCP packets transmitted is much smaller than the volume of transmitted media data. The selection is based on the number of threads to be synchronized and the width of the channel.

Theoretically, the channel width should be evenly divided by participants (audio or video conferencing). In practice, applications calculate the width based on the estimated number of simultaneously active participants. For example, for an audio conference it is usually one participant: if several people start speaking at the same time, then no one will understand anything. But for a videoconference all the more difficult: to show the video from several participants is quite a popular scenario. In such situations, the reporting interval is calculated individually for each participant.

5% of channel width is allocated for RTCP packets.

For scenarios with a large number of receiving devices and a small number of sending devices (webinar, voice conference), a quarter of the reporting channel is evenly distributed for transmitting devices, and the remaining three quarters for receiving. This distribution allows new connected devices to quickly get a CNAME and time stamp for synchronization. And in order for new connected devices to quickly transfer information about themselves, the interval for sending RTCP packets for them is two times less than for other participants.

The recommended minimum interval for sending RTCP packets is 5 seconds .

This value can be reduced to 360 / channel width (in seconds) for situations where data is transmitted in both directions and additional information needs to be quickly transmitted to control packet loss.

Extended RTP Profile for RTCP Feedback

If a client notices packet loss or network problems, it cannot immediately send an RTCP packet and must wait for the end of the interval. And there, for a moment, 5 seconds. To resolve the issue in the specification, there is “Extended RTP Profile for RTCP-Based Feedback” - this is an extension of the RTP timings rules.

If both devices support such a profile, then they may agree to send RTCP packets more often than specified in the specification. Until the average speed of sending packets falls within the specified interval. The same extension describes several additional messages that customers can use to describe what is happening with the media data: Negative Acknowledgment (NACK), Picture Loss Indication (PLI), Slice Loss Indication (SLI) and Reference Picture Selection Indication (RPSI).

Source: https://habr.com/ru/post/354502/

All Articles