Speech coding at 1600 bps with LPCNet neural vocoder

This is a continuation of the first article about LPCNet . In the first demo, we presented an architecture that combines signal processing and deep learning to improve the efficiency of neural speech synthesis. This time, turn LPCNet into a neural speech codec with a very low bitrate (see scientific article ). It can be used on current equipment and even on phones.

For the first time, a neural vocoder works in real time on a single processor core of the phone, and not on a high-speed GPU. The total bit rate of 1600 bps is about ten times less than the usual broadband codecs provide. The quality is much better than existing vocoders with a very low bitrate and is comparable to more traditional codecs that use a higher bitrate.

Waveform Encoders and Vocoders

There are two large types of speech codecs: waveform coders and vocoders. Waveform coders include Opus, AMR / AMR-WB and all codecs that can be used for music. They try to provide a decoded waveform as close as possible to the original — usually with some perceptual features. On the other hand, vocoders are actually synthesizers. The encoder extracts information about the pitch and shape of the vocal tract, transmits this information to the decoder, and he re-synthesizes speech. This is almost like speech recognition followed by reading text in a speech synthesizer, except that the text encoder is much simpler / faster than speech recognition (and conveys a bit more information).
')
Vocoders have existed since the 70s, but since their decoders perform speech synthesis, they cannot be much better than conventional speech synthesis systems, which until recently sounded just awful. That is why vocoders were usually used at speeds below 3 KB / s. In addition, waveform coders simply provide better quality. This continued until recently, when neural speech synthesis systems, such as WaveNet, appeared . Suddenly, the synthesis began to sound much better, and, of course, there were people who wanted to do a vocoder from WaveNet .

LPCNet Review

WaveNet produces very high-quality speech, but requires hundreds of gigaflops of computing power. LPCNet significantly reduced computational complexity. The Vocoder is based on WaveRNN, which enhances WaveNet using a recurrent neural network (RNN) and sparse matrices. LPCNet further enhances WaveRNN with linear prediction (LPC), which performed well in older vocoders. It predicts a sample from a linear combination of previous samples and, most importantly, it does it many times faster than a neural network. Of course, it is not universal (otherwise the 70's vocoders would have sounded great), but it can seriously reduce the load on the neural network. This allows you to use a network smaller than WaveRNN, without sacrificing quality.

Take a closer look at LPCNet. The yellow part on the left is calculated once per frame, and its output is used for the network of the sampling frequency on the right (blue). The computational unit predicts a sample at time t based on previous samples and linear prediction coefficients.

Compression characteristics

LPCNet synthesizes speech from vectors of 20 signs for each frame of 10 ms. Of these, 18 signs are cepstral coefficients representing the shape of the spectrum. The two remaining ones describe the pitch: one parameter for the pitch of the pitch period and the other for the power (how much the signal correlates with itself if you enter a delay per pitch of the pitch). If you store the parameters as floating-point values, then all this information takes up to 64 kbps during storage or transmission. This is too much, because even the Opus codec provides very high-quality speech coding for a total of 16 kbps (for 16 kHz mono). Obviously, you need to apply strong compression here.

Height

All codecs rely heavily on pitch, but unlike waveform coders, where pitch “simply” helps reduce redundancy, vocoders do not have a fallback. If it is wrong to pick up the pitch, they will start generating bad-sounding (or even illegible) speech. Without going into details (see the scientific article), the LPCNet encoder is struggling not to make a mistake in height. The search begins with a search for correlations in time in the speech signal. See below how a typical search works.

The height step is the period during which the step signal is repeated. The animation searches for a step that corresponds to the maximum correlation between the signal x (n) and its copy of x (nT) with a delay. The T value with the maximum correlation is the pitch of the height

This information needs to be encoded as few bits as possible, without worsening the result too much. Since we naturally perceive the frequency in a logarithmic scale (for example, each musical octave doubles the previous frequency), there is a sense in logarithmic coding. The height of the speech signal in most people (we are not trying to cover the soprano here) is between 62.5 and 500 Hz. With seven bits (128 possible values), we get a resolution of about a quarter of a tone (the difference between and before and re is one tone).

So, with the height finished? Well, not so fast. People don't speak like robots from 1960s films. The pitch of the voice can vary even within a 40-ms packet. It is necessary to take this into account, leaving the bits for the parameter for changing the height: 3 bits to encode the difference up to 2.5 semitones between the beginning and the end of the packet. Finally, it is necessary to code the correlation of the height steps, distinguishing vowels and consonants (for example, s and f). Two bits are enough for correlation.

Cepstrum

While pitch contains external speech characteristics (prosody, emotions, accent, ...), the spectral response determines what was said (with the exception of tonal languages such as Chinese, where pitch is important for meaning). The vocal cords produce approximately the same sound for any vowel, but the shape of the vocal tract determines which sound will be pronounced. The voice path acts as a filter, and the task of the encoder is to evaluate this filter and transmit it to the decoder. This can be effectively done if you transform the spectrum into a cepstrum (yes, this is a “spectrum” with a changed letter order, here we are funny guys in digital signal processing).

For a 16 kHz input signal, the cepstrum mainly represents a vector of 18 numbers every 10 ms, which should be compressed as much as possible. Since we have four such vectors in a 40 ms packet and they usually resemble each other, we want to eliminate redundancy as much as possible. This can be done using neighboring vectors as predictors and passing only the difference between the prediction and the real value. At the same time, we don’t want to depend too much on previous packages if one of them disappears. It seems that the problem has already been solved ...

If you only have a hammer, everything looks like a nail - Abraham Maslow.

If you worked a lot with video codecs , then you probably met the concept of B-frames. Unlike video codecs, which divide a frame into multiple packets, we, on the contrary, have many frames in one packet. We start with the coding of the key frame , i.e. the independent vector, and the end of the packet. This vector is encoded without prediction, occupying 37 bits: 7 for total energy (first cepstral coefficient) and 30 bits for the remaining parameters, using vector quantization (VQ). Then comes the (hierarchical) B-frames. Of the two keywords (one of the current package and one of the previous one), a cepstrum between them is predicted. As a predictor for coding the difference between the real value and the prediction, you can choose either of two key frames or their average value. We reapply VQ and encode this vector using a total of 13 bits, including the predictor selection. Now we only have two vectors and very few bits. Use the last 3 bits to simply select the predictor for the remaining vectors. Of course, all this is much easier to understand in the picture:

Prediction and quantization of a cepstrum for a packet k. Green vectors are quantized independently, blue - with prediction, and red - use prediction without residual quantization. The prediction is shown by arrows

Putting it all together

Adding all of the above, we get 64 bits per 40-millisecond packet or 1600 bits per second. If you want to calculate the compression ratio, then uncompressed wideband speech is 256 kbps (16 kHz 16 bits per sample), which means a compression ratio of 160 times! Of course, you can always play with the quantizers and get a lower or higher bitrate (with a corresponding impact on quality), but you need to start somewhere. Here is a tablet with a layout where these bits go.

Bit allocation
Parameter	Bit
Step height	6
Height modulation	3
Height correlation	2
Energy	7
Independent VQ Capr (40 ms)	thirty
Predicted kepstr vq (20 ms)	13
Cepstra interpolation (10 ms)	3
Total	64

64 bits per packet 40 ms, with 25 packets per second, 1600 bps is obtained.

Implementation

LPCNet source code is available under the BSD license. It includes a library that simplifies the use of codec. Please note that the development is not finished: both the format and the API are bound to change. The repository also has a demo application lpcnet_demo , in which it is easy to test the codec from the command line. For complete instructions, see the README.md file.

Who wants to dig deeper, there is an option to train new models and / or use LPCNet as a building block for other applications, such as speech synthesis (LPCNet is only one component of the synthesizer, it does not perform synthesis by itself).

Performance

Neural speech synthesis requires a lot of resources. At last year's ICASSP conference, Bastian Klein and colleagues from Google / DeepMind presented a WaveNet-based 2400 bps codec that received a bit stream from codec2. Although it sounds amazing, the computational complexity of hundreds of gigaflops means that it cannot be run in real time without an expensive GPU and no serious effort.

On the contrary, our 1600 bit / s codec produces only 3 gigaflops and is designed to work in real time on much more affordable hardware. In fact, it can be used today in real-world applications. Optimization required to write some code for AVX2 / FMA and Neon instruction sets (only embedded code, no assembler). Thanks to this, we can now encode (and especially decode) speech in real time not only on a PC, but also on more or less modern phones. Below is the performance on x86 and ARM processors.

Performance
CPU	Frequency	% of one core	To real time
AMD 2990WX (Threadripper)	3.0 GHz *	14%	7,0x
Intel Xeon E5-2640 v4 (Broadwell)	2.4 GHz *	20%	5.0x
Snapdragon 855 (Cortex-A76 on Galaxy S10 )	2.82 GHz	31%	3.2x
Snapdragon 845 (Cortex-A75 on Pixel 3 )	2.5 GHz	68%	1.47x
AMD A1100 (Cortex-A57)	1.7 GHz	102%	0.98x
BCM2837 (Cortex-A53 on Raspberry Pi 3)	1.2 GHz	310%	0.32x
* turbo mode

The numbers are quite interesting. Although only Broadwell and Threadripper are shown, on the x86 platform, Haswell and Skylake processors have similar performance (taking into account the clock frequency). However, ARM processors differ markedly from each other. Even with the difference in frequency, the A76 is five to six times faster than the A53: quite expected, since the A53 is mainly used for energy efficiency (for example, in big.LITTLE systems). However, LPCNet can work in real time on a modern phone using only one core. Although it would be nice to run it in real time and on the Raspberry Pi 3. Now this is far away, but nothing is impossible.

On x86, it is still a mystery about the reason for the performance limit of five times the theoretical maximum. As you know, matrix-vector multiplication operations are less efficient than matrix-by-matrix operations, because there are more downloads per operation — specifically, one load per matrix for each FMA operation. On the one hand, performance is related to the L2 cache, which provides only 16 bits per cycle. On the other hand, Intel claims that L2 can give up to 32 bits per cycle on Broadwell and 64 bits per cycle on Skylake.

results

We conducted audio tests on the model MUSHRA, to compare the quality of coding. Testing conditions:

Sample : original (if your result is better than the original, there is clearly something wrong with your test)
1600 bps LPCNet : our demo
Uncompressed LPNet : “LPNet with 122 equivalent units” from the first article
9000 bps opus / wideband : the lowest bitrate at which Opus 1.3 encodes wideband audio
2400 bps MELP : a well-known low bitrate vocoder (similar in quality to codec2)
Speex 4000 bps : This broadband vocoder should never be used, but it’s a good bottom line.

In the first test (set 1) we have eight speech fragments of statements from two men and two women. The files in the first set refer to the same database (that is, the same recording conditions) that was used for training, but these specific people were excluded from the training set. In the second test (set 2), we used some of the files from the Opus test (uncompressed), recording the sound in different conditions, to make sure that LPCNet comes to some generalization. There are 100 participants in both tests, so the errors are quite small. See results below.

Subjective quality (MUSHRA) in two tests

Overall, LPCNet at 1600 bps looks good - much better than MELP at 2400 bps, and not far behind Opus at 9000 bps. At the same time, uncompressed LPCNet is slightly better in quality than Opus at 9000 bps. This means that it is possible to provide better quality than Opus, at bitrates in the range of 2000-6000 bit / s.

Listen to yourself

Here are samples from the audio test:

Woman (Set 1)

Male (Set 1)

Mixed (set 2)

Where can this be used?

We believe that this is a cool technology in itself, but it also has practical application. Here are some options.

VoIP in countries with poor connectivity

Not everyone always has a high-speed connection available. In some countries, communication is very slow and unreliable. The 1600 bit / s speech codec works fine in such conditions, even transmitting packets several times for reliability. Of course, due to the overhead of packet headers (40 bytes for IP + UDP + RTP), it is better to make more packets: 40, 80 or 120 ms.

Amateur / HF Radio

For the past ten years, David Rowe has been working on speech coding for radio communications. He developed Codec2 , which transmits voice at speeds from 700 to 3200 bps. During the past year, David and I discussed how to improve Codec2 with neural synthesis, and now we finally did it. In his blog, David wrote about his own implementation of a codec based on LPCNet for integration with FreeDV .

Increased reliability with packet loss

The ability to encode a decent quality bitstream in a small number of bits is useful for providing redundancy on an unreliable channel. Opus has a Forward Error Correction (FEC) mechanism, known as LBRR, which encodes the previous frame with a lower bit rate and sends it in the current frame. It works well, but adds significant overhead. Duplication of the stream at 1600 bps is much more efficient.

Plans

There are still many possibilities to explore using LPCNet. For example, the improvement of existing codecs (of the same Opus). As in other codecs, Opus quality degrades pretty quickly at very low bit rates (below 8000 bps), because the waveform codec does not have enough bits to match the original. But the transmitted linear prediction information is enough for LPCNet to synthesize decently sounding speech - better than Opus can do on this bitrate. In addition, the rest of the information transmitted by Opus (residual forecast) helps LPCNet synthesize an even better result. In a sense, LPCNet can be used as a fancy post filter to improve the quality of Opus (or any other codec) without changing the bitstream (i.e., maintaining full compatibility).

Additional resources

J.-M.Valin, J.Skoglund, 1.6 Kbps wideband neural vocoder using LPCNet , Sent to Interspeech 2019 , arXiv: 1903.12087 .
J.-M.Valin, J.Skoglund, LPCNet: improved neural speech synthesis through linear prediction , Proc. ICASSP, 2019 , arXiv: 1810.11846 .
A. van den Oord, S. Dileman, H. Zen, K. Simonian, O. Vinyals, A. Graves, N. Kalkhbrenner, E. Seigneur, K. Kavukuglu, WaveNet: A Generative Model for Unprocessed Sound , 2016.
N. Karlhbrenner, E. Elsen, K. Simonyan, S. Nouri, N. Casagrande, E. Lokhart, F. Stimberg, A. van den Oord, S. Dileman, K. Kavukuglu, Effective Neural Synthesis of Sound , 2018.
V.B. Klein, F.S.K.Lim, A.Lubs, J.Skoglund, F.Stimberg, K.Vang, TS.Walters, Low bitrate speech coding based on Wavenet , 2018
LPCNet source code .
FreeDV codec based on David Row's LPCNet .
Join the development discussion on #opus at irc.freenode.net (→ web interface )

Source: https://habr.com/ru/post/446656/

All Articles