📜 ⬆️ ⬇️

Pocketsphinx. Speech Recognition and Voice Control in Linux

- Is everything alright, Leon?
The speakers are adjusted to the maximum, I frown, I answer:
- Yes. Hush sound.
“The sound is quieter,” Vindous-Home agrees, “quieter, quieter ...”
- Enough, Vika
S.Lukyanenko, "Labyrinth of reflections"

Introduction


In 1997, Lukyanenko predicted for the desktop a combination of CLI and voice control. However, now voice control is a fairly narrow niche.
Voice control - interaction with the device using sound commands. Do not confuse this concept with speech recognition. For voice control, it is enough that the device responds to the only command you need (because your dog cannot work as a typist?). Speech recognition is a much more global problem: in this case, the device must convert all words spoken by you into text format. As it is easy to guess, speech recognition is currently implemented superficially in relation to human capabilities.
The functionality discussed in the article can be used, for example, for organizing a smart home now or just computer control. Frankly, a couple of paragraphs would suffice to describe computer management, but I will try to show you the basics of working with CMU Sphinx.
By the way, 70 percent of the description described here is suitable for Windows users.

For Linux, two developed speech recognition projects are most often mentioned: CMU Sphinx and Julius .

In this article, I will not touch Julius, because there are enough guides on its use (including on the Runet). It will be about CMU Sphinx.

Description and Installation


On the official site, the latest version of pocketsphinx and sphinxbase is 0.8. In the repository of my Debian Squeeze there is only an obsolete sphinx2 branch. So, we will build (check the repositories of your distributions: in recent versions of Ubuntu and Fedora there must be current versions). If necessary, instead of pocketsphinx, written in C, you can use Sphinx4 in Java ( more ).
Download from here the source code of pocketsphinx and sphinxbase ("support library required by Pocketsphinx and Sphinxtrain") and collect.
Problems should not arise
./configure make checkinstall 

Do not forget to run after installation:
 ldconfig 


Finished Debian Squeeze x86 Packages

To work with / dev / dsp, install the oss-compat package according to the FAQ .

Basic use


The project offers us to test performance on a basic example: to recognize the phrase in English “go forward ten meters”.
Well, we try.
We will use the pocketsphinx_batch batch recognition utility (read the man before using it). There is also a recognition tool "from the microphone" - pocketsphinx_continuous (its syntax is similar to pocketsphinx_batch).
We will work in a separate directory, for example ~ / sphinx.
The syntax of our team is:
 pocketsphinx_batch -argfile argfile 2>./errors 

-argflie: the name of the file in the current directory containing all the arguments.
For convenience, we will redirect stderr to a file.
Content argfile:
 -hmm /usr/local/share/pocketsphinx/model/hmm/en_US/hub4wsj_sc_8k -lm /usr/local/share/pocketsphinx/model/lm/en/turtle.DMP -dict /usr/local/share/pocketsphinx/model/lm/en/turtle.dic -cepdir /home/saint/sphinx -ctl ctlfile -cepext .raw -adcin true -hyp outname 

-hmm: path to the directory containing the acoustic model files (templates for individual sounds).
-lm: path to the file of the trigram language model (you can read here ).
-dict: path to the pronunciation dictionary file.
-cepdir: path to the directory with sound files. Be careful: if you enter -cepdir into the argument file, the shortcut ~ / sphinx is not processed correctly: you have to write the full path. If you prescribe the argument after the command, you can use an abbreviated path.
-ctl: file with the names of the files being processed. We will take the goforward.raw file from the source set of pocketsphinx (there are a couple of * .raw files there too - you can recognize them).
-cepext: file extension
-adcin: pointer to the raw file to be processed.
-hyp: the name of the file to which the recognized text will be displayed.
Arguments with paths to model files must be specified. Remember that many parameters are set by default (see stderr). Therefore, to work with the * .raw file, it is necessary to force the extension, otherwise the default parameter will be used - the .mfc extension (of course, we don’t have such files in the base example - errors will occur).
As a result of execution, we will have the following contents in the outname file:
 go forward ten meters (goforward -26532) 

At the same time, you can view, compile and run a similar C program in the directory with the goforward.raw file (example from the developers).
To check on my examples, I decided not to be philosophical and took advantage of sox (check if you have this package installed).
We will write the sound as follows (you can read man sox ):
- for raw
 rec -r 16k -e signed-integer -b 16 -c 1 filename.raw 

- for wav
 rec -r 16k -e signed-integer -b 16 -c 1 filename.wav 

End of recording by Ctrl+C
My sox at the same time cursed the impossibility of using the sampling rate: can't set sample rate 16000; using 48000 can't set sample rate 16000; using 48000 . Consider: brazenly lying - in fact, everything is in order.
I wrote and recognized raw and wav on various examples from connected dictionaries - everything was recognized quite acceptable .
')

Adapt the sound model


Adapting a sound model should improve recognition for a particular voice, pronunciation, accent, or environment. Consider this process.

We download the suggested files to the separate directory in the first link, in which we will work.
Now let's dictate the sentences from the arctic20.txt file according to the sample: you should have twenty files named in order according to the arctic_0001.wav ... arctic_0020.wav .
To simplify the recording, use the proposed script:
 for i in `seq 1 20`; do fn=`printf arctic_%04d $i`; read sent; echo $sent; rec -r 16000 -e signed-integer -b 16 -c 1 $fn.wav 2>/dev/null; done < arctic20.txt 

Accordingly, in order to listen to the received, run:
 for i in *.wav; do play $i; done 

Copy the acoustic model (with which we worked) from /usr/local/share/pocketsphinx/model/hmm/en_US/hub4wsj_sc_8k to our working directory.
Now we will create acoustic features files (I remind you: we work in a directory with * .wav files).
 sphinx_fe -argfile hub4wsj_sc_8k/feat.params -samprate 16000 -c arctic20.listoffiles -di . -do . -ei wav -eo mfc -mswav yes 

As a result, we obtain * .mfc files.
Download extra pack (89.0 MB); a file called mixture_weights from it, located in pocketsphinx-extra/model/hmm/en_US/hub4_wsj_sc_3s_8k.cd_semi_5000 placed in the directory with the acoustic model.
You also need to convert the mdef file of the acoustic model into a text format:
 pocketsphinx_mdef_convert -text hub4wsj_sc_8k/mdef hub4wsj_sc_8k/mdef.txt 

Now, according to the adaptation guideline terminology, we will collect the accumulated data. Copy the bw utility from /usr/local/libexec/sphinxtrain/bw to the working directory (before that do not forget to install sphinxtrain!).
 ./bw -hmmdir hub4wsj_sc_8k -moddeffn hub4wsj_sc_8k/mdef.txt -ts2cbfn .semi. -feat 1s_c_d_dd -svspec 0-12/13-25/26-38 -cmn current -agc none -dictfn arctic20.dic -ctlfn arctic20.fileids -lsnfn arctic20.transcription -accumdir . 

Run and see:
SYSTEM_ERROR: "corpus.c", line 339: Unable to open arctic20.fileids for reading: No such file or directory
Obviously, the right hand of the developers does not know what the left is doing (I’m not even talking about the irrelevance of the documentation).
Rename arctic20.listoffiles to arctic20.fileids in the working directory
Now everything works.
We produce MLLR-adaptation (effective for a limited amount of data in the model):
 cp /usr/local/libexec/sphinxtrain/mllr_solve /your/work/dir/mllr_solve 

 ./mllr_solve -meanfn hub4wsj_sc_8k/means -varfn hub4wsj_sc_8k/variances -outmllrfn mllr_matrix -accumdir . 

This command will create an adaptation data file mllr_matrix .
Now, when recognizing with the adapted model, you can add the parameter -mllr /path/to/mllr_matrix .
In parallel, we produce another adaptation method: MAP.
 cp /usr/local/libexec/sphinxtrain/map_adapt /your/work/dir/map_adapt 

Make a copy of the model:
 cp -a hub4wsj_sc_8k hub4wsj_sc_8kadapt 

And we will make MAP-adaptation:
 ./map_adapt -meanfn hub4wsj_sc_8k/means -varfn hub4wsj_sc_8k/variances -mixwfn hub4wsj_sc_8k/mixture_weights -tmatfn hub4wsj_sc_8k/transition_matrices -accumdir . -mapmeanfn hub4wsj_sc_8kadapt/means -mapvarfn hub4wsj_sc_8kadapt/variances -mapmixwfn hub4wsj_sc_8kadapt/mixture_weights -maptmatfn hub4wsj_sc_8kadapt/transition_matrices 

Now we will create a sendump file that has a smaller size:
 cp /usr/local/libexec/sphinxtrain/mk_s2sendump /your/work/dir/mk_s2sendump 

 ./mk_s2sendump -pocketsphinx yes -moddeffn hub4wsj_sc_8kadapt/mdef.txt -mixwfn hub4wsj_sc_8kadapt/mixture_weights -sendumpfn hub4wsj_sc_8kadapt/sendump 

Adaptation is complete.

Adaptation testing


The essence of the experiment: we record several samples on which the original acoustic model stumbles, and process them with the help of adapted acoustic models.
Create a subdirectory test in the working directory. In it we create a subdirectory wav , in which there will be our test records.
Here, to confirm the result of the experiment, I post my samples and two adapted acoustic models.

Checking (remember that the adaptation will not lead to one hundred percent correct result: the adapted models will make the same mistake; plus that they will do it less often. My quite visual notes were made far from the first attempt: there were enough records where they were wrong all models):
1. Recognition using the base model:
 pocketsphinx_batch -hmm /usr/local/share/pocketsphinx/model/hmm/en_US/hub4wsj_sc_8k -lm /usr/local/share/pocketsphinx/model/lm/en/turtle.DMP -dict /usr/local/share/pocketsphinx/model/lm/en/turtle.dic -cepdir wav -ctl adaptation-test.fileids -cepext .wav -adcin yes -hyp adaptation-test.hyp 

Result:
 hello halt say forty (test1 -27391) go forward ten meter (test2 -35213) hall doing home (test3 -30735) 

2. Recognition using the model with MLLR adaptation: when specifying the -mllr parameter to the path to my matrix, a segmentation error occurred (I did not dig). In case of recognition without this option, the result is completely identical to the result of the original model.
However, the manual stated that the MLLR-adaptation is best suited for a continuous model (ie, for Sphinx4).
3. Recognition using the model with MAP-adaptation:
 pocketsphinx_batch -hmm ../hub4wsj_sc_8kadapt -lm /usr/local/share/pocketsphinx/model/lm/en/turtle.DMP -dict /usr/local/share/pocketsphinx/model/lm/en/turtle.dic -cepdir wav -ctl adaptation-test.fileids -cepext .wav -adcin yes -hyp adaptation-test.hyp 

Result:
 hello rotate display (test1 -28994) go forward ten meters (test2 -33877) lost window (test3 -29293) 


As you can see, the result is completely identical to the record. Adaptation really works!

Russian language in Pocketsphinx


Download from here the Russian models created on voxforge . A variety of models can be viewed here and just on the Internet.

We will implement the example of voice control of a computer in Russian, which means we need our own language model and our own vocabulary (most likely, parts of our words in common examples will not be).

Creating your own static language model

In general, for a small number of words, a jsgf dictionary can be used instead of a static language model. However, this is a special case and we will consider it below.
A guide to creating a language model is here .
We will create using CMUCLMTK.
Download , collect.
First, create a text file with suggestions for our language model.
lmbase.txt
 <s>     </s> <s>      </s> <s>     </s> <s>      </s> <s>     </s> <s>      </s> <s>    </s> <s>     </s> <s>     </s> <s>    </s> <s>     </s> <s>     </s> <s>    </s> <s>       </s> <s>        </s> <s>     </s> <s>     </s> <s>    </s> <s>    </s> <s>    </s> <s>   </s> <s>    </s> <s>    </s> <s>    </s> <s>   </s> <s>    </s> <s>    </s> <s>      </s> <s>    </s> <s>   </s> <s>     </s> <s>    </s> <s>    </s> <s>   </s> 


Next, create a dictionary file:
 text2wfreq <lmbase.txt | wfreq2vocab> lmbase.tmp.vocab cp lmbase.tmp.vocab lmbase.vocab 

Create a language model in arpa-format:
 text2idngram -vocab lmbase.vocab -idngram lmbase.idngram < lmbase.txt idngram2lm -vocab_type 0 -idngram lmbase.idngram -vocab lmbase.vocab -arpa lmbase.arpa 

And create a DMP model.
 sphinx_lm_convert -i lmbase.arpa -o lmbase.lm.DMP 


Creating your own vocabulary

Dragging the utility from the githab:
 git clone https://github.com/zamiron/ru4sphinx/ yourdir 

Go to the directory ./yourdir/text2dict and create there a text file my_dictionary with your word list (each new word - from a new paragraph).
Example my_dictionary
browser
louder
shut down
run
window
post office
expand
roll up
roll up
terminal
hush

Then we execute:
 perl dict2transcript.pl my_dictionary my_dictionary_out 

And your dictionary is created.

Now we try to recognize the words present in the dictionary (the blessing, in our example there are some of them). Do not forget to specify your own language model and vocabulary in the arguments - everything should work. If you wish, you can adapt the acoustic model (I’m warning you right away that when using the bw utility, the -svspec option -svspec not needed for the adaptation of most acoustic models).

Using Grammar File JavaScript instead of static language model

Read here.
Syntax:
 #JSGF V1.0; grammar test; public <test> = ( <a> | <a> <b> ); <a> = (  |  |  |  ); <b> = [  ]; 

"|" denotes a selection condition. Those. we can say "quieter" or "close the window." True, compared to using the language model, there is one drawback: we need to speak much more articulately.
We specify the created jsgf file with the -jsgf parameter (the -lm parameter is not needed in this case).

Implementation of voice control


My goal was not to implement a cool management interface: everything will be very primitive here (if you have the desire and opportunity, you can look at the abandoned Gnome Voice Control project).
We will act as follows:
1. We write a command, we recognize it.
2. We transfer the recognized text to a file, in accordance with it we execute the command.
As test commands we will use the decrease and increase the volume of the sound.

After carefully reading the manual to sox, I decided to finish the recording after a second of silence with a silence threshold of 3.8% (the threshold is clearly a purely individual value and depends on your microphone and environment).
Unfortunately, I did not find the output parameter for recognized words only in pocketsphinx_batch, so I will use the sed tool:
 cat ~/sphinx/rus/outname | sed 's/\( (.*\)//' 

This construction will remove from the line like "our team (audio -4023)" the space before the opening bracket, its own and all subsequent content. As a result, we get a line like “our team”, which is what we need.
Here is the script itself:
 #!/bin/bash rec -r 16k -e signed-integer -b 16 -c 1 test.wav silence 0 1 00:01 3.8% pocketsphinx_batch -argfile argfile 2>./errors a=$(cat ~/sphinx/rus/outname | sed 's/\( (.*\)//') case $a in ) amixer -q sset Master 10- a=$(amixer sget Master | grep "Mono: Playback") notify-send "$a" ;; ) amixer -q sset Master 10+ a=$(amixer sget Master | grep "Mono: Playback") notify-send "$a" ;; *) ;; esac 

The script in response to the "quieter" or "louder" commands performs the appropriate actions with the alarm via notify-send.
Unfortunately, it will work only when it is launched from the terminal (otherwise the sound is not written). However, he gives an idea of ​​voice control (you may suggest the best method).

A few words as a conclusion


When I started writing this article, I just wanted to tell you that with free recognition, now everything is far from being as dull as it seems. There are engines, work is being done on acoustic models on www.voxforge.org (you can help them with vocabulary reading). At the same time, work with recognition is not something difficult for a simple user.
A few days ago it was announced that Pocketsphinx or Julius would be used in Ubuntu for tablets. I hope in this light this topic looks a little more relevant and interesting.
In the article I tried to consider the main points of working with Pocketsphinx, speaking more about theory than about practice. However, you could see that recognition is not a fiction, it works.
Remember that I have touched on only superficial aspects: the documentation on official websites and various forums describes a lot more. Try, tell in the comments about your experience, share ideas, good thoughts and ready-made solutions for voice control.

Source: https://habr.com/ru/post/167479/


All Articles