Audio and video in the Tox messenger

While some people are thinking about regulating instant messengers, other people are developing distributed instant messengers. In a previous publication , the use of the Tox messenger kernel API was considered using the example of creating a simple echo-bot. The development of Tox does not stand still, and on November 3rd, the Tox core was enriched with a new audio and video call subsystem - ToxAV , which I would like to talk about in this publication.

Introduction

ToxAV is based on Opus codecs for audio and VP8 for video and their implementations in the libopus and libvpx libraries, respectively.
')
As an input and output format for audio, the ToxAV API uses PCM format with 16-bit samples, 1 or 2 channel support, and 8, 12, 16, 24, and 48KHz sampling rates. The duration of one data frame should be 2.5, 5, 10, 20, 40 or 60 ms (requirement of the opus codec).

As an input and output format for video, the ToxAV API uses frames in YUV420 format (aka IYUV and I420), which can be relatively easily converted into the more familiar RGB color space.

ToxAV does not provide any methods for capturing data from audio and video devices or their playback - all this is left to the discretion of the application developer. For example, the µTox client uses OpenAL for working with audio and v4l (video4Linux) for working with video devices. The qTox client uses FFmpeg to work with video. In python (which will be discussed indirectly below) we can use PyAudio for audio and OpenCV for video.

The general cycle of working with ToxAV can be presented sequentially in the form:

Initialization of the ToxCore core (described in detail in a previous publication ).
Initializing the ToxAV subsystem ( toxav_new ).
Setting callback functions for event handling ( toxav_callback_ * ) - event handlers are called from the main loop (4) and usually contain the main logic of the application.
The main work cycle ( toxav_iterate ) and event handling.
Pause for a time of toxav_iteration_interval and return to the previous step.

Since The main way to get knowledge of how Tox API works is to read the source code (written in C), to simplify the following, I will use the python wrapper ( pytoxcore on github ). For those who do not wish to engage in self-build library from source, there are also links to ready-made binary packages for common distributions.

When using python wrappers, you can get library help in the following way:

$ python >>> from pytoxcore import ToxAV >>> help(ToxAV) class ToxAV(object) | ToxAV object ... | toxav_answer(...) | toxav_answer(friend_number, audio_bit_rate, video_bit_rate) | Accept an incoming call. | If answering fails for any reason, the call will still be pending and it is possible to try and answer it later. Audio and video receiving are both enabled by default. | | toxav_audio_receive_frame_cb(...) | toxav_audio_receive_frame_cb(friend_number, pcm, sample_count, channels, sampling_rate) | This event is triggered when a audio data received. ...

Below we will focus on each step of working with the API in a little more detail.

ToxAV Initialization

To initialize the ToxAV subsystem, an instance of the previously initialized ToxCore kernel is used as a parameter to the toxav_new call. Only one ToxAV instance can be created for a single ToxCore instance. In the python wrapper, the toxav_new call is hidden inside the constructor and initialization looks like:

 from pytoxcore import ToxCore, ToxAV class EchoBot(ToxCore): ... class EchoAVBot(ToxAV): def __init__(self, core): super(EchoAVBot, self).__init__(core) ... bot = EchoBot(options) botav = EchoAVBot(bot)

To destroy a ToxAV instance, use the toxav_kill call or destroy the class instance in the python wrapper, where the toxav_kill call is hidden in the destructor.

Setting callback functions

In the python wrapper, the connection to the supported callback functions is performed automatically. The handlers themselves can be methods of a ToxAV successor and have the * _cb suffix. In all handlers, one of the parameters is friend_number - an integer friend identifier from similar ToxCore methods.

toxav_call_cb (friend_number, audio_enabled, video_enabled) - incoming call. As arguments, additional flags of audio and video support are transmitted from the contact side. In the future, contact with disabled audio or video streams can turn them on, which will trigger a call status change event. The call can be accepted by calling toxav_answer , or rejected by calling toxav_call_control .

toxav_call_state_cb (friend_number, state) - change the state of the call. The argument is the state, which is the bit mask of the constants:

TOXAV_FRIEND_CALL_STATE_ERROR - timeout for transferring data to a contact. This state is the last state for the call (at this point the call is completed) and cannot be combined with other states.
TOXAV_FRIEND_CALL_STATE_FINISHED - normal call termination. This state is the last state for the call (at this point the call is completed) and cannot be combined with other states.
TOXAV_FRIEND_CALL_STATE_SENDING_A - the contact has started the audio stream.
TOXAV_FRIEND_CALL_STATE_SENDING_V - the contact has started the transmission of the video stream.
TOXAV_FRIEND_CALL_STATE_ACCEPTING_A - the contact has started receiving an audio stream.
TOXAV_FRIEND_CALL_STATE_ACCEPTING_V - the contact has started receiving a video stream

toxav_bit_rate_status_cb (friend_number, audio_bit_rate, video_bit_rate) is a network overload event when the kernel does not have time to send data with the required bitrate. As parameters, the kernel offers new bit rates for audio and video stream. At first, the kernel tries to reduce the bitrate of the video stream as occupying the main band and only after turning off the video stream, an attempt is made to decrease the bitrate for audio. The application can ignore these recommendations or use the toxav_bit_rate_set (friend_number, audio_bit_rate, video_bit_rate) call to set new bitrate values. Bitrates are specified in Kb / s (kilobits per second), a value of 0 indicates that the corresponding data stream is disconnected, a value of -1 leaves the previous bitrate unchanged.

For the opus audio codec, the minimum bitrate is 6, and the values above 16-32 to my ear are no longer distinguishable for sound from a webcam.

For the V8 video codec, I could not find the recommended bandwidth values depending on the frame size and FPS. In general, the following basic bitrates are recommended for different video resolutions:

frame size, px	bitrate, Kb / s
320x240	400
480x270	700
1024x576	1500
1280x720	2500
1920x1080	4,000

toxav_audio_receive_frame_cb (friend_number, pcm, sample_count, channels, sampling_rate) - receiving a frame of audio data. The parameters are the data buffer, the number of samples, the number of channels and the sampling frequency (Hz). For stereo sound, 16-bit samples go in series for the left and right channels. The buffer size in bytes is defined as sample_count * channels * 2 byte , and the buffer duration in milliseconds as sampling_rate / sample_count .

Since PCM is a conventional pulse code modulation format, the buffer data can be transmitted with little or no conversion to any DAC or saved to the WAV file by adding the appropriate header with the format description, but you should be prepared that during the conversation any parameters of the audio stream can be changed by the caller.

toxav_video_receive_frame_cb (friend_number, width, height, y, u, v, ystride, ustride, vstride) - receiving a frame of video data in YUV420 format. This is the original callback from ToxAV. An example of implementing the conversion from YUV420 to BGR can be found, for example, in the source code µTox ( yuv420tobgr ).

toxav_video_receive_frame_cb (friend_number, width, height, rgb) —Receiving a frame in RGB or BGR format is an additional python wrapper callback that is not present in ToxAV. This option is used to speed up the necessary conversions within the library instead of converting them to pyhon (in fact, yuv420tobgr from the µTox code is used). The RGB or BGR format is specified by calling toxav_video_frame_format_set with the parameter:

TOXAV_VIDEO_FRAME_FORMAT_BGR - BGR format (used in OpenCV).
TOXAV_VIDEO_FRAME_FORMAT_RGB - RGB format.
TOXAV_VIDEO_FRAME_FORMAT_YUV420 is the original ToxAV callback with the YUV420 format.

Hereinafter, RGB / BGR means a 24-bit format without an alpha channel, where each of the components of red, green, and blue is allocated 8 bits each. Sometimes this format is referred to as RGB24 / BGR24 depending on the order of the color components of red and blue.

As with the audio stream, the format of the video stream can be changed at any time by the caller (for example, the frame size will change).

Event handling

To handle ToxAV events, a periodic toxav_iterate method call is used (similar to a kernel method tox_iterate call ) with a recommended interval between calls equal to the value returned by a toxav_iteration_interval call (by analogy with a tox_iteration_interval call ).

Duty cycle for ToxAV is recommended to serve in a separate thread, because in the absence of audio / video calls, the value toxav_iteration_interval will be 200 ms and the stream will sleep most of the time:

 import threading class EchoAVBot(ToxAV): def __init__(self, core): ... self.running = True self.iterate_thread = threading.Thread(target = self.iterate_cb) self.iterate_thread.start() def iterate_cb(self): while self.running: self.toxav_iterate() interval = self.toxav_iteration_interval() time.sleep(float(interval) / 1000.0) def stop(self): self.running = False self.iterate_thread.join() self.toxav_kill();

Since we have a GIL (Global Interpreter Lock) for python, we should not worry about the need for synchronization between threads, but in other languages there can be "unexpected" surprises when tox av _ * cb events can be triggered from a thread that serves tox_iterate thread serving tox av _iterate. Such an event, for example, is the toxav_call_state_cb call termination event - use the appropriate synchronization primitives taking into account the possibility of a deadlock inside the Tox kernel.

Call management

To create an outgoing call, the toxav_call method (friend_number, audio_bit_rate, video_bit_rate) is used , where the contact ID from the contact list and outgoing audio and video bitrate (Kb / s) are specified as arguments. Bitrate value 0 blocks sending the corresponding stream, which can be enabled later by calling toxav_bit_rate_set .

ToxAV does not have the usual control of making a call to us - an unanswered outgoing call will last until the call is canceled by any internal timeout of the application itself (the application "hangs up"), or the contact does not go offline (the network is broken) and the toxav_call_state_cb event with the parameter TOXAV_FRIEND_CALL_STATE_FINISHED ). By default, receiving audio and video data from a contact is allowed and can be changed by calling toxav_call_control .

An incoming call is determined by the toxav_call_cb event, after which the call can be either completely ignored or processed by answering the call or changing its state through toxav_call_control .

To answer the call, the toxav_answer call (friend_number, audio_bit_rate, video_bit_rate) is used , where the contact ID from the contact list making the incoming call and outgoing audio and video bitrate (Kb / s) are specified as arguments. Bitrate value 0 blocks sending the corresponding stream, which can be enabled later by calling toxav_bit_rate_set .

To control the state of a call, the toxav_call_control (friend_number, control) call is used, where the following constants can be used as a control parameter:

TOXAV_CALL_CONTROL_RESUME - the resumption of a call that was previously paused cannot be used until the call is answered.
TOXAV_CALL_CONTROL_PAUSE — setting a call to pause (hold, hold) cannot be used until the call is answered.
TOXAV_CALL_CONTROL_CANCEL - drop an incoming call or end an active call.
TOXAV_CALL_CONTROL_MUTE_AUDIO — Sends a request to the respondent to stop transmitting audio data (the respondent may ignore this request).
TOXAV_CALL_CONTROL_UNMUTE_AUDIO is a request sent to the respondent to resume the transfer of audio data.
TOXAV_CALL_CONTROL_HIDE_VIDEO — Sends a request to the respondent to stop transmitting video data (the respondent can ignore this request).
TOXAV_CALL_CONTROL_SHOW_VIDEO - a request to the respondent to resume the transfer of video data.

Transmission of audio and video streams

The audio stream is transmitted by calling toxav_audio_send_frame (friend_number, pcm, sample_count, channels, sampling_rate) , where the contact ID from the contact list, the PCM data buffer, the number of samples, the number of channels and the sampling frequency are specified as arguments. By analogy with toxav_audio_receive_frame_cb for stereo stream, 16-bit samples for the left and right channels should go one after the other.

It should be recalled that the maximum number of channels is 2 (stereo), the sampling rate can be 8, 12, 16, 24 and 48 KHz and, due to the features of the opus codec, the number of samples must correspond to a frame length of 2.5, 5, 10, 20, 40 or 60 ms.

To transfer a video stream frame, the toxav_video_send_frame call is used , which in the python wrapper is renamed toxav_video_send_yuv420_frame (friend_number, width, height, y, u, v) , where the arguments are the contact identifier from the contact list, the frame width and height in pixels, buffers Y (brightness) and color difference UV.

An example of the implementation of the conversion from BGR to YUV420 can be found, for example, in the source code µTox ( bgrtoyuv420 ). For convenience, the python wrapper implements the toxav_video_send_bgr_frame (friend_number, width, height, bgr) and toxav_video_send_rgb_frame (friend_number, width, height, rgb) methods, which perform these transformations themselves (bgrtoyuv420 is used from the code in the text of the tegavtext).

Examples

Python is a fairly convenient language for prototyping and learning how to work. To demonstrate the above, I wrote several examples :

echobot.py is a plain text echo-bot that responds with text sent to it. Additionally demonstrates the management of contacts, status, avatars, files. It is the base for the implementation of other examples.

echoavbot.py - audio and video echo-bot that sends back audio and video streams sent to the caller. It may be convenient to study the influence of the network on the quality of data transmission, the arising delays, testing of own equipment.

avbot.py - bot to simulate the respondent, transmits audio and video data from their own devices. It may be convenient to monitor pets during their absence :)

Conclusion

The Tox project is alive and well developed. The rupture of relations with the Tox Foundation did not slow down the development of the project and it seemed he gave a second wind - PR began to be taken, tickets were figured out, a new website was being actively filled and work was being done on the documentation. I hope in a short time the branch of working with group chats will be infused into the kernel and the kernel will take on a complete pre-release look.