Programming low latency sound in iOS

The article will discuss the features of the low-level API for working with sound in iOS, which had to be encountered during the development of Viber. It will be about choosing the size of the hardware buffer and the behavior of AudioUnit when the sample rate changes.

For software work with sound in iOS, Apple provides 4 groups of APIs, each of which is designed to solve a specific class of tasks:

AVFoundation allows you to play and record files and buffers in memory with the ability to use the platform provided hardware or software implementations of some audio codecs. It is recommended to use when there are no stringent requirements for low latency playback and playback.
OpenAL API is intended for rendering and playback of three-dimensional sound as well as the use of sound effects. It is used mainly in games. Provides low latency playback, but does not provide the ability to record sound.
AudioQueue is a basic API for recording and playback of audio streams with the ability to use codecs provided by the platform. Using this API, it will not be possible to get the minimum delay, but using it is extremely simple.
Finally, AudioUnit , a powerful and rich API, for working with audio streams. Compared to Mac OS X on iOS, the programmer is not fully accessible, but it is best suited for recording and playing sound as close as possible to hardware.

Audiounit

Quite a lot has been written about initialization and basic use of AudioUnit, including examples in official documentation. Consider not quite the trivial features of its configuration and use. The interaction with the sound, as close as possible to the equipment, is the RemoteIO and VoiceProcessingIO modules. VoiceProcessingIO adds to RemoteIO the ability to control additional sound processing at the OS level to improve the quality of voice reproduction and automatic signal level correction (AGC). From the point of view of the programmer, both of these modules have “input” and “output”, to which 2 buses are connected.

The programmer can set and request the audio stream format on these buses. By requesting the format of the bus stream 1 at the AudioUnit input, you can find out the parameters of the stream received from the microphone at the hardware level, and by setting the bus format 1 at the output, you can determine in what format the audio stream will be transmitted to the application. Accordingly, setting the format of the bus 0 input AudioUnit we report the format of the audio stream, which we will provide for playback, and by requesting the format of the bus 0 output - find out what format the equipment uses for playback. The exchange of buffers with AudioUnit takes place in 2x callbacks with a signature:

OSStatus AURenderCallback(void * inRefCon, AudioUnitRenderActionFlags * ioActionFlags, const AudioTimeStamp * inTimeStamp, Int32 inBusNumber, UInt32 inNumberFrames, AudioBufferList * ioData);

InputCallback is called when the module is ready to provide us with a buffer of data recorded from the microphone. To get this data into the application, you need to call the AudioUnitRender function in this callback. RenderCallback is called when a module requests the application for playback data to be written to the ioData buffer. These callbacks are called in the context of the AudioUnit internal flow and should work as soon as possible. Ideally, their work should be limited to copying ready-made data buffers. This introduces additional difficulties in the organization of audio signal processing in terms of stream synchronization. In addition to buffers, a time stamp is sent to these callbacks in the form:

 struct AudioTimeStamp { Float64 mSampleTime; //     UInt64 mHostTime; //      //       // ... UInt32 mFlags; //       };

This time stamp can (and should) be used to detect missing samples during recording and playback. The main reasons for the loss of samples:

Switch recording and playback devices (speaker / headphone / Bluetooth). It is impossible to get rid of the loss of part of the samples in this case. The time stamp can be used to correct further audio processing, for example, synchronization with a video stream or recalculation of the “Timestamp” field of an RTP packet.
Too much CPU usage, in which the AudioUnit stream does not have enough time to work, is corrected by optimizing the algorithms or by refusing to support insufficiently powerful devices.
Errors in the implementation of stream synchronization when working with audio data buffers. In this case, correct use of lock-free structures, cyclic buffers, GCD will help (however, GCD is not always a good solution for problems close to real-time). You can use System Trace from Instruments to identify the causes of problems with thread synchronization.

Hardware Buffer Size

In the ideal case, to obtain a minimum delay in sound recording, any intermediate buffering is absent. However, in the real world, hardware and software are more optimized to work with groups of consecutive samples, rather than with single samples. At the same time, iOS provides the ability to adjust the size of the hardware buffer. The property of the PreferredHardwareIOBufferDuration audio session allows you to request the required buffer duration in seconds, and CurrentHardwareIOBufferDuration to get the real one. Possible durations depend on the currently used sampling rate. For example, by default, when playing through the built-in speakers and recording through the built-in microphone, the equipment will operate at a sampling frequency of 44100Hz. The minimum buffer that the audio subsystem operates on is 256 bytes, the size is usually equal to the power of two (the values were obtained experimentally and are not included in the documentation). Therefore, the buffer may have a duration of:
256/44100 = 5.805ms
512/44100 = 11.61ms
1024/44100 = 23.22ms
If you use a bluetooth headset with a sampling frequency of 16000Hz, then the size of the hardware buffer can be:
256/16000 = 16ms
512/16000 = 32ms
1024/16000 = 64ms

The duration of the hardware buffer affects not only the delay, but also the number of samples that AudioUnit exchanges with the application each time a callback is called. If the frequency of the discrepancy at the input and output of AudioUnit coincides, a buffer equal in duration to the hardware one will be transmitted to the callback and the callback will be called at regular intervals. Accordingly, if the application algorithms are designed to work with sequences of 10ms, in any case, intermediate buffering on the application side will be necessary, since AudioUnit cannot be configured to work with buffers of arbitrary duration. The size of the hardware buffer is best chosen experimentally, given the performance of specific devices. The reduction improves latency, but adds overhead to switching threads when calling callbacks and increases the likelihood of skipping samples when CPU is high.
')

Buffering when changing the sampling rate

In applications for VoIP communication, it does not always make sense to process audio with a sampling frequency above 16000Hz. In addition, it is easier to abstract from the hardware sampling frequency, since it can change at any time when switching the sound source. Configuring AudioUnit, you can set the sampling rate of the audio stream when exchanging data with AudioUnit. Consider how this will work for sound recording in the following example:

 SampleRateHW = 44100 //    buffSizeHW = 1024 //    (1024 / 44100 = 23.22ms) mSampleRateAPP = 16000 //    buffSizeAPP = 1024 * 16000/44100 = 371.52 //

After resampling, the output will be an integer number of samples, and the fractional remainder will be stored as resampler filter coefficients. The behavior of AudioUnit is quite different in iOS5 and iOS6.

iOS5

In iOS5, AudioUnit modules exchange buffers that are multiples of powers of two, so the application will receive 256 samples (16ms @ 16kHz) during the first call. The remaining 371-256 = 115 will remain inside AudioUnit.

The second time the callback is called, the application will again receive a buffer of 256 samples: some of the data in it will be from the previous hardware buffer, and some from the new one.

In the third call, the remainder accumulated after resampling will allow 512 samples to be transferred to the application immediately.

Then, again, the application receives 256 samples.

Thus, when recording with resampling, the callback will be called at regular intervals, and the size of the buffer it receives will not be constant (but equal to the power of two).

iOS6

In iOS6, the restriction on the size of the buffer transferred between the application and AudioUnit was removed, thus getting rid of intermediate buffering during resampling and, accordingly, reducing the delay. The application will receive buffers of size 371 and 372 samples alternately.

API CoreAudio can hardly be called clear and well documented. Many features of the work have to learn experimentally, but we must remember that the behavior may differ in different versions of the OS. For those who are interested in the topic of real-time audio processing, in addition to the Apple documentation, we can recommend the “iZotope iOS Audio Programming Guide“ .

Source: https://habr.com/ru/post/171021/

All Articles