- Is everything alright, Leon?
The speakers are adjusted to the maximum, I frown, I answer:
- Yes. Hush sound.
“The sound is quieter,” Vindous-Home agrees, “quieter, quieter ...”
- Enough, Vika
S.Lukyanenko, "Labyrinth of reflections"
Introduction
In 1997, Lukyanenko predicted for the desktop a combination of CLI and voice control. However, now voice control is a fairly narrow niche.
Voice control - interaction with the device using sound commands. Do not confuse this concept with speech recognition. For voice control, it is enough that the device responds to the only command you need (because your dog cannot work as a typist?). Speech recognition is a much more global problem: in this case, the device must convert all words spoken by you into text format. As it is easy to guess, speech recognition is currently implemented superficially in relation to human capabilities.
The functionality discussed in the article can be used, for example, for organizing a smart home now or just computer control. Frankly, a couple of paragraphs would suffice to describe computer management, but I will try to show you the basics of working with CMU Sphinx.
By the way, 70 percent of the description described here is suitable for Windows users.For Linux, two developed speech recognition projects are most often mentioned:
CMU Sphinx and
Julius .
Links to articles on the subject (you can preview the subject) In this article, I will not touch Julius, because there are enough guides on its use (including on the Runet). It will be about CMU Sphinx.
Description and Installation
On the
official site, the latest version of pocketsphinx and sphinxbase is 0.8. In the repository of my Debian Squeeze there is only an obsolete sphinx2 branch. So, we will build (check the repositories of your distributions: in recent versions of Ubuntu and Fedora there must be current versions). If necessary, instead of pocketsphinx, written in C, you can use Sphinx4 in Java (
more ).
Download
from here the source code of pocketsphinx and sphinxbase ("support library required by Pocketsphinx and Sphinxtrain") and collect.
Problems should not arise./configure make checkinstall
Do not forget to run after installation:
ldconfig
Finished Debian Squeeze x86 Packages To work with / dev / dsp, install the oss-compat package according to the
FAQ .
Basic use
The project offers us to test performance on a basic example: to recognize the phrase in English “go forward ten meters”.
Well, we try.
We will use the pocketsphinx_batch batch recognition utility (read the man before using it). There is also a recognition tool "from the microphone" - pocketsphinx_continuous (its syntax is similar to pocketsphinx_batch).
We will work in a separate directory, for example ~ / sphinx.
The syntax of our team is:
pocketsphinx_batch -argfile argfile 2>./errors
-argflie: the name of the file in the current directory containing all the arguments.
For convenience, we will redirect stderr to a file.
Content argfile:
-hmm /usr/local/share/pocketsphinx/model/hmm/en_US/hub4wsj_sc_8k -lm /usr/local/share/pocketsphinx/model/lm/en/turtle.DMP -dict /usr/local/share/pocketsphinx/model/lm/en/turtle.dic -cepdir /home/saint/sphinx -ctl ctlfile -cepext .raw -adcin true -hyp outname
-hmm: path to the directory containing the acoustic model files (templates for individual sounds).
-lm: path to the file of the trigram language model (you can read
here ).
-dict: path to the pronunciation dictionary file.
-cepdir: path to the directory with sound files. Be careful: if you enter -cepdir into the argument file, the shortcut ~ / sphinx is not processed correctly: you have to write the full path. If you prescribe the argument after the command, you can use an abbreviated path.
-ctl: file with the names of the files being processed. We will take the goforward.raw file from the source set of pocketsphinx (there are a couple of * .raw files there too - you can recognize them).
-cepext: file extension
-adcin: pointer to the raw file to be processed.
-hyp: the name of the file to which the recognized text will be displayed.
Arguments with paths to model files must be specified. Remember that many parameters are set by default (see stderr). Therefore, to work with the * .raw file, it is necessary to force the extension, otherwise the default parameter will be used - the .mfc extension (of course, we don’t have such files in the base example - errors will occur).
As a result of execution, we will have the following contents in the outname file:
go forward ten meters (goforward -26532)
At the same time, you can view, compile and run a
similar C
program in the directory with the goforward.raw file (example from the developers).
To check on my examples, I decided not to be philosophical and took advantage of
sox (check if you have this package installed).
We will write the sound as follows (you can read
man sox
):
- for raw
rec -r 16k -e signed-integer -b 16 -c 1 filename.raw
- for wav
rec -r 16k -e signed-integer -b 16 -c 1 filename.wav
End of recording by
Ctrl+C
My sox at the same time cursed the impossibility of using the sampling rate:
can't set sample rate 16000; using 48000
can't set sample rate 16000; using 48000
. Consider: brazenly lying - in fact, everything is in order.
I wrote and recognized raw and wav on various examples from connected dictionaries -
everything was recognized quite acceptable .
')
Adapt the sound model
Adapting a sound model should improve recognition for a particular voice, pronunciation, accent, or environment. Consider this process.
We download the suggested files to the separate directory in the first link, in which we will work.
Now let's dictate the sentences from the arctic20.txt file according to the sample: you should have twenty files named in order according to the
arctic_0001.wav
...
arctic_0020.wav
.
To simplify the recording, use the proposed script:
for i in `seq 1 20`; do fn=`printf arctic_%04d $i`; read sent; echo $sent; rec -r 16000 -e signed-integer -b 16 -c 1 $fn.wav 2>/dev/null; done < arctic20.txt
Accordingly, in order to listen to the received, run:
for i in *.wav; do play $i; done
Copy the acoustic model (with which we worked) from
/usr/local/share/pocketsphinx/model/hmm/en_US/hub4wsj_sc_8k
to our working directory.
Now we will create acoustic features files (I remind you: we work in a directory with * .wav files).
sphinx_fe -argfile hub4wsj_sc_8k/feat.params -samprate 16000 -c arctic20.listoffiles -di . -do . -ei wav -eo mfc -mswav yes
As a result, we obtain * .mfc files.
Download
extra pack (89.0 MB); a file called
mixture_weights
from it, located in
pocketsphinx-extra/model/hmm/en_US/hub4_wsj_sc_3s_8k.cd_semi_5000
placed in the directory with the acoustic model.
You also need to convert the mdef file of the acoustic model into a text format:
pocketsphinx_mdef_convert -text hub4wsj_sc_8k/mdef hub4wsj_sc_8k/mdef.txt
Now, according to the adaptation guideline terminology, we will collect the accumulated data. Copy the
bw
utility from
/usr/local/libexec/sphinxtrain/bw
to the working directory (before that do not forget to install sphinxtrain!).
./bw -hmmdir hub4wsj_sc_8k -moddeffn hub4wsj_sc_8k/mdef.txt -ts2cbfn .semi. -feat 1s_c_d_dd -svspec 0-12/13-25/26-38 -cmn current -agc none -dictfn arctic20.dic -ctlfn arctic20.fileids -lsnfn arctic20.transcription -accumdir .
Run and see:
SYSTEM_ERROR: "corpus.c", line 339: Unable to open arctic20.fileids for reading: No such file or directory
Obviously, the right hand of the developers does not know what the left is doing (I’m not even talking about the irrelevance of the documentation).
Rename
arctic20.listoffiles
to
arctic20.fileids
in the working directory
Now everything works.
We produce MLLR-adaptation (effective for a limited amount of data in the model):
cp /usr/local/libexec/sphinxtrain/mllr_solve /your/work/dir/mllr_solve
./mllr_solve -meanfn hub4wsj_sc_8k/means -varfn hub4wsj_sc_8k/variances -outmllrfn mllr_matrix -accumdir .
This command will create an adaptation data file
mllr_matrix
.
Now, when recognizing with the adapted model, you can add the parameter
-mllr /path/to/mllr_matrix
.
In parallel, we produce another adaptation method: MAP.
cp /usr/local/libexec/sphinxtrain/map_adapt /your/work/dir/map_adapt
Make a copy of the model:
cp -a hub4wsj_sc_8k hub4wsj_sc_8kadapt
And we will make MAP-adaptation:
./map_adapt -meanfn hub4wsj_sc_8k/means -varfn hub4wsj_sc_8k/variances -mixwfn hub4wsj_sc_8k/mixture_weights -tmatfn hub4wsj_sc_8k/transition_matrices -accumdir . -mapmeanfn hub4wsj_sc_8kadapt/means -mapvarfn hub4wsj_sc_8kadapt/variances -mapmixwfn hub4wsj_sc_8kadapt/mixture_weights -maptmatfn hub4wsj_sc_8kadapt/transition_matrices
Now we will create a
sendump
file that has a smaller size:
cp /usr/local/libexec/sphinxtrain/mk_s2sendump /your/work/dir/mk_s2sendump
./mk_s2sendump -pocketsphinx yes -moddeffn hub4wsj_sc_8kadapt/mdef.txt -mixwfn hub4wsj_sc_8kadapt/mixture_weights -sendumpfn hub4wsj_sc_8kadapt/sendump
Adaptation is complete.
Adaptation testing
The essence of the experiment: we record several samples on which the original acoustic model stumbles, and process them with the help of adapted acoustic models.
Create a subdirectory
test
in the working directory. In it we create a subdirectory
wav
, in which there will be our test records.
Here, to confirm the result of the experiment, I post my samples and two adapted acoustic models.
Checking (remember that the adaptation will not lead to one hundred percent correct result: the adapted models will make the same mistake; plus that they will do it less often. My quite visual notes were made far from the first attempt: there were enough records where they were wrong all models):
1. Recognition using the base model:
pocketsphinx_batch -hmm /usr/local/share/pocketsphinx/model/hmm/en_US/hub4wsj_sc_8k -lm /usr/local/share/pocketsphinx/model/lm/en/turtle.DMP -dict /usr/local/share/pocketsphinx/model/lm/en/turtle.dic -cepdir wav -ctl adaptation-test.fileids -cepext .wav -adcin yes -hyp adaptation-test.hyp
Result:
hello halt say forty (test1 -27391) go forward ten meter (test2 -35213) hall doing home (test3 -30735)
2. Recognition using the model with MLLR adaptation: when specifying the -mllr parameter to the path to my matrix, a segmentation error occurred (I did not dig). In case of recognition without this option, the result is completely identical to the result of the original model.
However, the manual stated that the MLLR-adaptation is best suited for a continuous model (ie, for Sphinx4).
3. Recognition using the model with MAP-adaptation:
pocketsphinx_batch -hmm ../hub4wsj_sc_8kadapt -lm /usr/local/share/pocketsphinx/model/lm/en/turtle.DMP -dict /usr/local/share/pocketsphinx/model/lm/en/turtle.dic -cepdir wav -ctl adaptation-test.fileids -cepext .wav -adcin yes -hyp adaptation-test.hyp
Result:
hello rotate display (test1 -28994) go forward ten meters (test2 -33877) lost window (test3 -29293)
As you can see, the result is completely identical to the record. Adaptation really works!
Russian language in Pocketsphinx
Download
from here the Russian models created on
voxforge . A variety of models can be viewed
here and just on the Internet.
We will implement the example of voice control of a computer in Russian, which means we need our own language model and our own vocabulary (most likely, parts of our words in common examples will not be).
Creating your own static language model
In general, for a small number of words, a jsgf dictionary can be used instead of a static language model. However, this is a special case and we will consider it below.
A guide to creating a language model is
here .
We will create using CMUCLMTK.
Download , collect.
First, create a text file with suggestions for our language model.
lmbase.txt <s> </s> <s> </s> <s> </s> <s> </s> <s> </s> <s> </s> <s> </s> <s> </s> <s> </s> <s> </s> <s> </s> <s> </s> <s> </s> <s> </s> <s> </s> <s> </s> <s> </s> <s> </s> <s> </s> <s> </s> <s> </s> <s> </s> <s> </s> <s> </s> <s> </s> <s> </s> <s> </s> <s> </s> <s> </s> <s> </s> <s> </s> <s> </s> <s> </s> <s> </s>
Next, create a dictionary file:
text2wfreq <lmbase.txt | wfreq2vocab> lmbase.tmp.vocab cp lmbase.tmp.vocab lmbase.vocab
Create a language model in arpa-format:
text2idngram -vocab lmbase.vocab -idngram lmbase.idngram < lmbase.txt idngram2lm -vocab_type 0 -idngram lmbase.idngram -vocab lmbase.vocab -arpa lmbase.arpa
And create a DMP model.
sphinx_lm_convert -i lmbase.arpa -o lmbase.lm.DMP
Creating your own vocabulary
Dragging the utility from the githab:
git clone https://github.com/zamiron/ru4sphinx/ yourdir
Go to the directory ./yourdir/text2dict and create there a text file
my_dictionary with your word list (each new word - from a new paragraph).
Example my_dictionarybrowser
louder
shut down
run
window
post office
expand
roll up
roll up
terminal
hush
Then we execute:
perl dict2transcript.pl my_dictionary my_dictionary_out
And your dictionary is created.
Now we try to recognize the words present in the dictionary (the blessing, in our example there are some of them). Do not forget to specify your own language model and vocabulary in the arguments - everything should work. If you wish, you can adapt the acoustic model (I’m warning you right away that when using the bw utility, the
-svspec
option
-svspec
not needed for the adaptation of most acoustic models).
Using Grammar File JavaScript instead of static language model
Read here.Syntax:
"|" denotes a selection condition. Those. we can say "quieter" or "close the window." True, compared to using the language model, there is one drawback: we need to speak much more articulately.
We specify the created jsgf file with the -jsgf parameter (the -lm parameter is not needed in this case).
Implementation of voice control
My goal was not to implement a cool management interface: everything will be very primitive here (if you have the desire and opportunity, you can look at the abandoned
Gnome Voice Control project).
We will act as follows:
1. We write a command, we recognize it.
2. We transfer the recognized text to a file, in accordance with it we execute the command.
As test commands we will use the decrease and increase the volume of the sound.
After carefully reading the manual to sox, I decided to finish the recording after a second of silence with a silence threshold of 3.8% (the threshold is clearly a purely individual value and depends on your microphone and environment).
Unfortunately, I did not find the output parameter for recognized words only in pocketsphinx_batch, so I will use the
sed
tool:
cat ~/sphinx/rus/outname | sed 's/\( (.*\)//'
This construction will remove from the line like "our team (audio -4023)" the space before the opening bracket, its own and all subsequent content. As a result, we get a line like “our team”, which is what we need.
Here is the script itself:
The script in response to the "quieter" or "louder" commands performs the appropriate actions with the alarm via notify-send.
Unfortunately, it will work only when it is launched from the terminal (otherwise the sound is not written). However, he gives an idea of voice control (you may suggest the best method).
A few words as a conclusion
When I started writing this article, I just wanted to tell you that with free recognition, now everything is far from being as dull as it seems. There are engines, work is being done on acoustic models on
www.voxforge.org (you can help them with vocabulary reading). At the same time, work with recognition is not something difficult for a simple user.
A few days ago
it was announced that Pocketsphinx or Julius would be used in Ubuntu for tablets. I hope in this light this topic looks a little more relevant and interesting.
In the article I tried to consider the main points of working with Pocketsphinx, speaking more about theory than about practice. However, you could see that recognition is not a fiction, it works.
Remember that I have touched on only superficial aspects: the documentation on official websites and various forums describes a lot more. Try, tell in the comments about your experience, share ideas, good thoughts and ready-made solutions for voice control.