Ultrafast speech recognition without servers using a real example

In this article I will tell you in detail and show you how to quickly and correctly fasten the recognition of Russian speech on the Pocketsphinx engine (for iOS, the port OpenEars ) with a real ~~Hello World~~ example of home appliances control.
Why home appliances? Yes, because thanks to such an example, one can estimate the speed and accuracy that can be achieved using completely local speech recognition without servers such as Google ASR or Yandex SpeechKit .
I also attach all source codes of the program and assembly under Android to article.

Why should I?

Having stumbled upon an article about screwing Yandex SpeechKit to an iOS application , I asked the author why he wanted to use server-based speech recognition for his program (in my opinion, this was unnecessary and led to some problems). To which I received a counter question about whether I could describe in more detail the use of alternative methods for projects where there is no need to recognize anything, and the dictionary consists of a finite set of words. Yes, and with an example of practical application ...
')

Why do we need something else besides Yandex and Google?

As the very “practical application” I chose the topic of voice control of the smart home .
Why such an example? Because on it you can see those several advantages of completely local speech recognition before recognition using cloud solutions. Namely:

Speed - we do not depend on servers and therefore do not depend on their availability, bandwidth, etc. factors
Accuracy - our engine works only with the dictionary that interests our application, thereby improving the quality of recognition
Cost - we do not have to pay for each request to the server
Voice activation - as an additional bonus to the first points - we can constantly “listen to the broadcast” without spending our traffic and not loading the server

Note

At once I will make a reservation that these advantages can be considered as advantages only for a certain class of projects , where we know in advance exactly which dictionary and grammar the user will operate on. That is, when we do not need to recognize an arbitrary text (for example, an SMS message, or a search query). In the opposite case, cloud recognition is indispensable.

So Android is able to recognize speech without the Internet!

Yes ... only on JellyBean. And only half a meter, no more. And this recognition is the same dictation, only using a much smaller model. So we cannot control it and tune it either. And what she will return to us next time is unknown. Although for SMS ok just right!

What do we do?

We will implement a voice remote control home appliances, which will work accurately and quickly, from several meters, and even on ~~cheap brake stuff~~ very inexpensive Android smartphones, tablets and watches.
The logic will be simple, but very practical. Activate the microphone and pronounce one or more device names. The application recognizes them and turns them on / off depending on the current state. Or receives from them a condition and says it in a pleasant female voice. For example, the current temperature in the room.

We will activate the microphone either by voice, or by clicking on the microphone icon, or even just putting a hand on the screen. The screen in turn can be completely turned off.

Variants of practical use mass

In the morning, without opening the eyes, they slammed the palm of the smartphone on the bedside table and commanded “Good morning!” - the script is launched, the coffee maker is switched on and buzzes, pleasant music is heard, the curtains are moved apart.
Hang on a cheap (thousands of 2, no more) smartphone in each room on the wall. We go home after work and command the emptiness “Smart home! Light, TV! ”- what happens next, I think, no need to say.

The video shows what happened in the end . Further, we will discuss the technical implementation with excerpts from the actually working code and some theory.

What is Pocketsphinx

Pocketsphinx is an open source recognition engine for Android. It also has a port for iOS , WindowsPhone , and even JavaScript .
It will allow us to start speech recognition directly on the device and at the same time configure it specifically for our tasks. It also offers the function of voice activation "out of the box" (see below).

We will be able to feed the Russian language model of the recognition engine (you can find it in the source code) and the grammar of user queries. This is exactly what our application will recognize. It can recognize nothing else. And therefore, almost never give out something that we do not expect.

JSGF Grammar

The JSGF grammar format is used by Pocketsphinx, as well as by many other similar projects. It is possible to describe with sufficient flexibility those phrases that the user will pronounce. In our case, the grammar will be built from the names of devices that are in our network, like this:

<commands> =  |  | ;

Pocketsphinx can also work on the statistical model of the language, which allows you to recognize spontaneous speech, which is not described by context-free grammar. But for our task it is just not necessary. Our grammar will consist only of device names. After the recognition process, Pocketsphinx will return us a regular line of text where the devices will go one by one.

 #JSGF V1.0; grammar commands; public <command> = <commands>+; <commands> =  |  | ;

The plus sign means that the user can name not one, but several devices in a row.
The application receives a list of devices from the smart home controller (see below) and generates such grammar in the Grammar class.

Transcriptions

A grammar describes what the user can say . In order for the Pocketsphinx to know how it will be pronounced, it is necessary to write for each word from the grammar how it sounds in the corresponding language model. That is the transcription of each word. This is called a dictionary .

Transcriptions are described using special syntax. For example:

  uu mn ay j  d oo m

In principle, nothing complicated. A double vowel in transcription denotes stress. Double consonant is a soft consonant followed by a vowel. All possible combinations for all sounds of the Russian language can be found in the language model itself .

It is clear that we cannot describe in advance all the transcriptions in our application, because we do not know in advance the names that the user will give to their devices. Therefore, we will generate “on the fly” such transcriptions according to some rules of Russian phonetics. To do this, you can implement the following PhonMapper class, which can receive a line at the input and generate the correct transcription for it.

Voice activated

This is the ability of the speech recognition engine to “listen to the air” all the time in order to react to a predetermined phrase (or phrases). In this case, all other sounds and speech will be discarded. This is not the same as describing a grammar and simply turning on a microphone. I will not give here the theory of this problem and the mechanics of how it works. I can only say that recently programmers working on Pocketsphinx have implemented such a function, and now it is available out of the box in the API.

One thing worth mentioning is. For an activation phrase, you need not only to specify the transcription, but also to select the appropriate threshold value . Too small a value will lead to a lot of false positives (this is when you did not say the activation phrase, but the system recognizes it). Too high for immunity. Therefore, this setting is of particular importance. The approximate range of values is from 1e-1 to 1e-40 , depending on the activation phrase .

Activation by proximity sensor

This task is specific to our project and is not directly related to recognition. The code can be seen right in the main activity .
It implements a SensorEventListener and at the time of approaching (the sensor value is less than the maximum) turns on the timer, checking after some delay whether the sensor is still blocked. This is done to eliminate false positives.
When the sensor is not blocked again, we stop the recognition, getting the result (see the description below).

We start recognition

Pocketsphinx provides a convenient API for configuring and running the recognition process. These are the SppechRecognizer and SpeechRecognizerSetup classes .
Here is how the configuration and the launch of recognition:

 PhonMapper phonMapper = new PhonMapper(getAssets().open("dict/ru/hotwords")); Grammar grammar = new Grammar(names, phonMapper); grammar.addWords(hotword); DataFiles dataFiles = new DataFiles(getPackageName(), "ru"); File hmmDir = new File(dataFiles.getHmm()); File dict = new File(dataFiles.getDict()); File jsgf = new File(dataFiles.getJsgf()); copyAssets(hmmDir); saveFile(jsgf, grammar.getJsgf()); saveFile(dict, grammar.getDict()); mRecognizer = SpeechRecognizerSetup.defaultSetup() .setAcousticModel(hmmDir) .setDictionary(dict) .setBoolean("-remove_noise", false) .setKeywordThreshold(1e-7f) .getRecognizer(); mRecognizer.addKeyphraseSearch(KWS_SEARCH, hotword); mRecognizer.addGrammarSearch(COMMAND_SEARCH, jsgf);

Here we first copy all the necessary files to disk (Pocketpshinx requires an acoustic model, grammar and dictionary with transcriptions on the disk). Then the recognition engine itself is configured. The paths to the model and dictionary files are indicated, as well as some parameters (the sensitivity threshold for the activation phrase). Next, configure the path to the file with the grammar, as well as the activation phrase.

As can be seen from this code, one engine is configured immediately for both grammar and recognition of the activation phrase. Why is this done? So that we can quickly switch between what we need to recognize at the moment. Here is the start of the activation phrase recognition process:

 mRecognizer.startListening(KWS_SEARCH);

And this is how the speech is interpreted according to a given grammar:

 mRecognizer.startListening(COMMAND_SEARCH, 3000);

The second argument (optional) is the number of milliseconds, after which the recognition will automatically terminate if no one says anything.
As you can see, you can use only one engine for solving both tasks.

How to get the recognition result

To get the recognition result, you must also specify an event listener that implements the RecognitionListener interface.
It has several methods that are called pocketsphinx when one of the events occurs:

onBeginningOfSpeech - the engine has heard some sound, maybe it is speech (or maybe not)
onEndOfSpeech - the sound is over
onPartialResult - there are intermediate recognition results. For an activation phrase, this means that it worked. Hypothesis argument contains recognition data (string and score)
onResult is the final result of recognition. This method will be called after calling the stop method of SpeechRecognizer . Hypothesis argument contains recognition data (string and score)

By implementing the onPartialResult and onResult methods in one way or another, you can change the recognition logic and get the final result. Here is how it is done in the case of our application:

 @Override public void onEndOfSpeech() { Log.d(TAG, "onEndOfSpeech"); if (mRecognizer.getSearchName().equals(COMMAND_SEARCH)) { mRecognizer.stop(); } } @Override public void onPartialResult(Hypothesis hypothesis) { if (hypothesis == null) return; String text = hypothesis.getHypstr(); if (KWS_SEARCH.equals(mRecognizer.getSearchName())) { startRecognition(); } else { Log.d(TAG, text); } } @Override public void onResult(Hypothesis hypothesis) { mMicView.setBackgroundResource(R.drawable.background_big_mic); mHandler.removeCallbacks(mStopRecognitionCallback); String text = hypothesis != null ? hypothesis.getHypstr() : null; Log.d(TAG, "onResult " + text); if (COMMAND_SEARCH.equals(mRecognizer.getSearchName())) { if (text != null) { Toast.makeText(this, text, Toast.LENGTH_SHORT).show(); process(text); } mRecognizer.startListening(KWS_SEARCH); } }

When we receive the onEndOfSpeech event, and if in this case we recognize the command to be executed, then it is necessary to stop recognition, after which onResult will be immediately called.
In onResult you need to check what has just been recognized. If this is a command, then you need to run it on execution and switch the engine to recognition of the activation phrase.
In onPartialResult, we are only interested in recognizing an activation phrase. If we find it, we immediately start the process of recognizing the command. Here is what it looks like:

 private synchronized void startRecognition() { if (mRecognizer == null || COMMAND_SEARCH.equals(mRecognizer.getSearchName())) return; mRecognizer.cancel(); new ToneGenerator(AudioManager.STREAM_MUSIC, ToneGenerator.MAX_VOLUME).startTone(ToneGenerator.TONE_CDMA_PIP, 200); post(400, new Runnable() { @Override public void run() { mMicView.setBackgroundResource(R.drawable.background_big_mic_green); mRecognizer.startListening(COMMAND_SEARCH, 3000); Log.d(TAG, "Listen commands"); post(4000, mStopRecognitionCallback); } }); }

Here we first play a small signal to alert the user that we heard him and are ready for his team. At this time, the microphone should be turned off. Therefore, we start the recognition after a short timeout (slightly longer than the signal duration, so as not to hear its echo). A thread is also started, which will stop recognition if the user speaks for too long. In this case, it is 3 seconds.

How to turn a recognized string into commands

Well, everything is already specific to a specific application. In the case of the naked example, we simply pull out the device names from the line, look for the device we need and either change its state using an HTTP request to the smart home controller, or report its current state (as in the case of the thermostat). This logic can be seen in the Controller class.

How to synthesize speech

Speech synthesis is the inverse operation of recognition. Here is the opposite - you need to turn a line of text into speech, so that the user can hear it.
In the case of a thermostat, we need to make our Android device speak the current temperature. Using the TextToSpeech API , this is quite easy to do (thanks to Google for the beautiful female TTS for the Russian language):

 private void speak(String text) { synchronized (mSpeechQueue) { mRecognizer.stop(); mSpeechQueue.add(text); HashMap<String, String> params = new HashMap<String, String>(2); params.put(TextToSpeech.Engine.KEY_PARAM_UTTERANCE_ID, UUID.randomUUID().toString()); params.put(TextToSpeech.Engine.KEY_PARAM_STREAM, String.valueOf(AudioManager.STREAM_MUSIC)); params.put(TextToSpeech.Engine.KEY_FEATURE_NETWORK_SYNTHESIS, "true"); mTextToSpeech.speak(text, TextToSpeech.QUEUE_ADD, params); } }

I will say probably a commonplace, but before the synthesis process it is necessary to turn off recognition . On some devices (for example, all samsung) in general it is not possible to simultaneously listen to the microphone and synthesize something.
The end of speech synthesis (that is, the end of the process of speaking the text with a synthesizer) can be tracked in the listener:

 private final TextToSpeech.OnUtteranceCompletedListener mUtteranceCompletedListener = new TextToSpeech.OnUtteranceCompletedListener() { @Override public void onUtteranceCompleted(String utteranceId) { synchronized (mSpeechQueue) { mSpeechQueue.poll(); if (mSpeechQueue.isEmpty()) { mRecognizer.startListening(KWS_SEARCH); } } } };

In it, we simply check if there is still something in the synthesis queue, and turn on the recognition of the activation phrase, if there is nothing else.

And it's all?

Yes! As you can see, to quickly and accurately recognize speech directly on the device is not difficult, thanks to the presence of such wonderful projects as Pocketsphinx. It provides a very convenient API that can be used in solving problems associated with the recognition of voice commands.

In this example, we screwed the recognition to a completely coherent task - voice control of smart home devices . Due to local recognition, we achieved a very high speed and minimized errors.
It is clear that the same code can be used for other tasks related to voice. It does not have to be exactly smart home.

All source codes, and also assembly of the application you can find in a repository on GitHub .
Also on my YouTube channel you can see some other voice control implementations, and not just smart home systems.

Source: https://habr.com/ru/post/237589/

All Articles