Speech Recognition on STM32F4-Discovery

In this article I want to talk about how you can recognize speech on the microcontroller using the STM32F4 Discovery debug board. Since speech recognition is a rather complicated task even for a computer, in this case it is performed using the Google service. Speech recognition in this way can be useful in different tasks, for example, in one of the “smart home” devices.

The STM32F4-Discovery debug card differs markedly from the STM32-Discovery debug board that is often mentioned in the articles. It is equipped with an STM32F407VGT6 microcontroller using the Cortex-M4F architecture, which has 1 MB of Flash and 192 KB of RAM. The controller is able to work with a clock frequency of 168 MHz.

An audio DAC with a built-in amplifier (its output is connected to the headphone jack) and a digital MEMS microphone are installed on the debug board, thanks to which, based on the STM32F4-Discovery, you can easily make a device that works with sound.

Speech recognition using Google Voice Search is described here: Article.
In order to recognize any said phrase using a microcontroller, you need to perform a number of actions:
')
• Record sound to controller memory.
• Perform audio coding.
• Connect to the Google server.
• Send a POST request and encoded audio data to the server.
• Receive a response from the server.

Voice recording
Since the board already has a digital microphone, we will record the sound with it. In this case, this is a PDM microphone. It has only two signal outputs - clocking and data. When a clock signal is present, a signal encoded with PDM modulation appears at the output of the microphone data (for more information, see Wikipedia: Pulse-density modulation ). On the STM32F4-Discovery, the microphone is connected to SPI / I2S — to receive data from the microphone, it is enough to configure I2S to receive data, and to receive data from the register from an I2S interrupt. This data is stored in the memory of the controller, and after a sufficient amount of data is recorded, it is filtered, resulting in several audio data samples.
Working with a microphone is described in the document AN3998 from ST - it explains the principle of operation of the microphone, the features of its connection and describes how to work with the filtering function.

On the ST site, among the various examples for the board, there is an example of working with sound, only it is quite sophisticated - the example shows how to play the sound from the controller’s memory and from a USB flash drive connected to the board. The sound recording on the USB flash drive is also demonstrated. I took the code for playing and recording sound from there. Only in this code there were a lot of errors and deficiencies - probably the example was written in a hurry.

Coding of recorded sound
Descriptions of working with the speech recognition service have already been cited on the Internet more than once. In all cases, the authors use the FLAC audio codec, as Google uses non-standard Speex data encoding.
This is evident from the Chromium browser code: The code responsible for recording the sound.
The description of the POST request indicates that the data type is “audio / x-speex-with-header-byte”.
That's just on the STM32 will not be able to encode the data in the format of FLAC - there are no such libraries. But the Speex code is ported to STM32, so for encoding I used this particular codec. From the Chromium code it is quite easy to understand what the modification of the codec is - before the beginning of each frame of the encoded audio data an additional byte is inserted, equal to the length of the frame in bytes.

Audio is encoded and encoded at the same time — using double buffering: while 160 audio data samples are written to one of the buffers, data from the other buffer is encoded into Speex format. The encoded data is stored in the memory of the controller. Record goes for 2 seconds, resulting in the formation of 2100 bytes of encoded audio data. The sampling rate is 8 kHz.

Contacting Google

A debug board with a WIFI module - RN-XV is used to connect to the Internet. It is equipped with a WIFI-module RN-171 (bottom of the board), an antenna, 3 diodes and pin connectors. Communication with the module goes through the UART, so 4 wires are enough to work with it. The cost of the board in sparkfun, from where I ordered it - $ 35. WIFI itself - the module costs $ 30. More information about the module can be found on the sparkfun website: RN-XV WiFly Module.

In order to transfer data to the server, you need to connect to it via TCP, and then send a request of this type:

POST http://www.google.com/speech-api/v1/recognize?xjerr=1&client=chromium&lang=ru-RU HTTP/1.1@#Content-type: audio/x-speex-with-header-byte; rate=8000@#Connection: close@#Content-length: 2100@#@#

Characters @ # program replaces the CRLF. After sending the request, you need to send 2100 bytes of encoded audio data to the server. After receiving all the data, the server performs speech recognition, and transmits the recognized string along with additional information, after which the connection with the server is closed.

After the server’s response is received, the program extracts the recognized string from it and outputs it via another UART microcontroller. The data from this UART is transmitted to a computer at the terminal, in the window of which the recognized phrase appears. After that, the controller is ready to start recording a new phrase.

The resulting construction looks like this:

And here is how it works:

Update:

Already after I posted the article, I was able to realize the launch of the recording when a loud sound (including speech) appears. To do this, the program constantly records and encodes sound. The encoded data is placed in an array. After reaching its end, data begins to fit at its beginning. At the same time, the program constantly checks whether a loud sound has appeared. When it appears, the program saves the value of the record pointer and keeps a record for 2 seconds. After stopping the recording, the program copies the data to another buffer. Since it is known at what point the sound appeared, you can take the data shortly before. Thus, the first sounds of the word are not lost.

Video work VAD:

The program is written in IAR.
The program allows you to play the recorded phrase before sending it to the server. To do this, just uncomment some lines in the function main.

There are several projects in the attached archive:
my_audio_test - records and immediately reproduces sound from a microphone.
speex_out - plays the Speex sound stored in the controller's flash memory.
speex_rec - records and encodes sound with Speex for 2 seconds, then plays the recording.
Speech_wifi is a speech recognition project itself, WIFI is used in this project.
speech_wifi_vad is a speech recognition project with VAD, this project uses WIFI.

www.dropbox.com/s/xke5rq8lzi980x5/NEW_VOICE.zip?dl=0

Source: https://habr.com/ru/post/146501/

All Articles

Speech Recognition on STM32F4-Discovery

More articles: