Smart voice extension for garlands (esp8266 + stm32)

Hi, Habr. Last year, I made a "smart" power strip for managing Christmas tree garlands. But then the hands never came to write an article about it. Correct.

Herringbone itself

On the Christmas tree there are 3 garlands, and under it is a brood of luminous white bears. When there are a lot of garlands, the question arises - how to manage them? Each time it is a dubious pleasure to crawl under the Christmas tree and turn on / off the necessary garlands from the socket.

Of course, a large number of "smart sockets" are sold - but with voice control, and so that 4 sockets in one device at once, without unnecessary wires and power supply units - I have not seen such.

So let's start

As the body ideally suited extension "pilot". Seen and bought in the nearest shop.

A 5 volt power supply unit and shield with 4 relays were found in stocks of small things bought on aliexpress.

Now, the most important thing is "brains". The brains in the project will be the Wiieva board, which has everything we need - a screen, a microphone, wifi, an Arduino form factor compatible with the shield. The WiFi module is implemented on the super popular esp8266, peripheral control and work with sound - on stm32f105rbt.

We collect a smart extension cord

Cut a hole in the body under the screen. Under the cutout came one socket and the old switch - a small loss.
On the back of the case, double-sided tape at the bottom - to make the board tighter

We separate the bus to which the sockets are connected, and draw the wires from each outlet separately. Mounted power unit with power supply.

Connect the brains - connect the Wiieva board and shield with the relay
')

We place all the components in their place

Top view of the "smart extension" assembly

A little aesthetics - we print a lid on a 3d printer

What happened

How does the software and hardware

The most difficult part of the project is the components responsible for audio input / output. In general, there are several approaches to recording and recognizing sound:

device recognition
Pros: no internet connection required
Cons: it requires a large computing power, a very limited vocabulary, a large percentage of errors.
recognition in the cloud, for example google or yandex
Pros: good quality, practically unlimited vocabulary
Cons: Internet connection required, increased latency
In the case of an IoT device that has a processor with 64kb of RAM and 160 MHz - it is impossible to make confident recognition of voice commands on board. You can train him to recognize a few words and then, having previously trained to your voice.

Therefore, for speech recognition used the service google speech recognition. It would seem, not a difficult task, to record sound from a microphone and send it to google speech recognition. However, when it comes to the device based on esp8266, the task is not trivial.

The esp8266 does not have a good ADC, and the one that is on board does not technically allow to record anything other than noise. Therefore, to capture the sound, as sufficient for speech recognition, at least an external ADC or, better still, an external processor to which the microphone is connected is needed. Having tried several options - stopped at stm32 + digital PDM microphone.

The next task is to control / transfer data from stm32 to esp8266. UART and i2c were immediately dropped as slow interfaces and the decision was made to use SPI. SPI is a synchronous interface with a mandatory distribution of roles: master and slave. In the combination of stm32 and esp8266, the main logic of the program is executed on esp8266, and stm32 is a coprocessor that works with peripherals. Therefore, it is logical to assign esp8266 the role of master, and stm32 - the role of the slave.

This bunch gave a good result: clear sound from a microphone without interference and without extraneous noise. Alas, the sound idyll did not last long, smoothly, until the received sound was sent via WiFi via http connection to google.

An unpleasant story happened: while there is no active transmission over WiFi, the sound is written in perfect quality. As soon as the active packet transmission over WiFi begins, the sound is immediately distorted by a crash. The oscilloscope examination showed that with active WiFi transmission over the power bus, the noise is not weak, and filtering them at the circuit design level is difficult.

Therefore, as often happens, the hardware problem had to be treated programmatically. The logical solution is to save the sound in the buffer, and upon completion of the phrase send by http to the cloud. It seemed like a business - to save in the buffer. But here we remember that we have only 40KB of free RAM. And even with a digitization frequency of 8kHz, 40KB will fit in just 2 seconds with a small recording of uncompressed speech. It will be small.

The solution was to pre-pack the sound with the SPEEX codec - it gives a 2KB rate per second, which is more than enough to record any voice command entirely in memory, and determine the end of the phrase with the VAD (Voice Activity Detector) algorithm.

Voila - this design has earned, and began to confidently recognize any spoken phrases.

Pro Wiieva fee

Here, probably, it is worth making a lyrical digression. For those who, having read to this paragraph, the most likely question arises - is it possible that there are so many movements just for the sake of herringbone voice control. The simple answer, of course, is not only. A couple of years ago, when esp8266 just came to me, I had an idea - to attach cloud recognition of speech to it. And, in my free time, I slowly saw the project with a familiar electronics engineer, which resulted in the wiieva board and the configuration described above. In the process of life, the project got a bunch of chips, for example, mp3 player with speaker, Arduino-compatible form factor, temperature / humidity / pressure sensors, touch screen, USB, IR diode and MicroSD slot.

Therefore, the main work was done before the start of the project “Herringbone”, and the project “Herringbone” was done in a couple of evenings, including the operation of grafting the board into an extension cord and writing a sketch with high-level logic.

Sketch with logic

The program is written as a sketch for the Arduino environment esp8266.

In addition to recognizing voice commands, the sketch has a UI - screensaver with a beautiful Christmas tree, a control screen with on / off buttons for garlands.
In addition to local management, there is the http API on / off garlands. This is to control the herringbone through the common interface of the smart home.

And with a bonus, the sketch is able to play mp3 files from a microSD card - I recorded there several Christmas compositions. That is to say, an additional trick, to maintain New Year's mood.

Sketch Sources

Initialization and start of recognition

//    #include <WiievaRecorder.h> WiievaRecorder recorder (2000*5); //    unsigned long timeRecorderStart = 0,timeRecorderEnd=0; bool wasVAD = false; void startRecognize () { //   recorder.start (AIO_AUDIO_IN_SPEEX); Serial.printf ("Start recording\n"); timeRecorderStart = millis(); timeRecorderEnd=0; wasVAD = false; }

The very recognition and execution of commands

 void processRecognize () { if (!timeRecorderStart) { return; } //   Voice Activity bool res = recorder.run (); bool vad = recorder.checkVad(); if (vad && !wasVAD) { Serial.printf("VAD: speech started\n"); } wasVAD = wasVAD || vad; if (millis () - timeRecorderStart < 3000 || vad) timeRecorderEnd = millis (); if (res && (!timeRecorderEnd || millis () - timeRecorderEnd < 500)) // VAD    -   return; recorder.stop(); timeRecorderStart = 0; if (!wasVAD) { //     -  return; } //  http    POST  google speech recognition HTTPClient http; http.begin(url); http.addHeader ("Content-Type","audio/x-speex-with-header-byte; rate=8000"); int httpCode = http.sendRequest ("POST",&recorder,recorder.recordedSize()); if(httpCode > 0) { Serial.printf("[HTTP] POST... code: %d\n", httpCode); String payload = http.getString(); Serial.println(payload); String cmd = "toggle"; //       //    JSON,           if (payload.indexOf ("")>=0 || payload.indexOf ("")>=0) cmd = "off"; else if (payload.indexOf ("")>=0 || payload.indexOf ("")>=0) cmd = "on"; if (payload.indexOf ("")>=0) startPlay(); else if (payload.indexOf ("")>=0) controlAllRelay (cmd); else { //    if (payload.indexOf ("")>=0) controlRelay (0,cmd); if (payload.indexOf ("")>=0) controlRelay (1,cmd); if (payload.indexOf ("")>=0|| payload.indexOf ("")>=0) controlRelay (2,cmd); if (payload.indexOf ("")>=0) controlRelay (3,cmd); } } http.end(); }

Under the hood

Sound digitization

The PDM Microphone is connected to the stm32 SPI / I2S2 processor. As reference I used this Application Note from ST

In order not to load the processor, data from I2S is obtained using DMA in the ring ping-pong buffer.
PDM. Processing of the received PDM data occurs by interrupts from the DMA. Working with DMA interrupts is fairly standard for stm32:
There are two signs of interrupting the filling of the upper / lower halves of the buffer. In the interrupt handler , the buffer half is selected, with the data already prepared

Then the buffer is converted from the PDM format to the usual PCM: a set of samples (signal level values) with the required sampling rate.

After conversion and resampling, data in PCM format is added to the pdm_samples_buf ring buffer.

Speex coding

The next stage of the pipeline is the sound packaging with the SPEEX codec. Audio processing by a codec is a very resource-intensive process, which consumes a lot of CPU time and does not call it in the interrupt handler very well.

Therefore, the packaging occurs asynchronously, in the main program loop - the code is code part two

At the same time with the coding in SPEEX, the presence of voice activity is analyzed by the VAD algorithm.
And the speex codec encoded speech adds up to another speex_buf ring buffer, from which they are already transmitted to esp9266

Transfer coded buffer from stm32 to esp8266

The interface between esp8266 and stm32 is built on the principle of command -> answer. esp8266 sends a command, stm32 processes the command and returns a response. In a part of the commands, a data buffer is transmitted along with the body of the command / or the response body.

From the side of esp8266, the work algorithm turned out to be very simple:
Send command to read data buffer and read data:

This is how esp8266 looks like:
recorder code
SPI code

From the stm32 side, the task looks more complicated:
Interrupts from SPI parses the command code, and depending on the command code, the required actions are performed. In our case, sending data from the SPEEX ring buffer to the SPI

Instead of conclusion

Many interesting moments, for example, such as playing mp3, connecting a graphic library, implementing screen drivers and touch panels, integrating with a smart home and much, much more, had to be left out of this article - there would have been too much text.

The plans still finish activating speech recognition by hot-word, for example, herringbone. To do this, I plan to drag a small piece of pocketsphinx on board and do something on board like MFCC + DTW ...