Speech recognition in python using pocketsphinx or how I tried to make a voice assistant

This is a tutorial on using the pocketsphinx library in Python. I hope he helps you
quickly deal with this library and do not step on my rake.

It all started with the fact that I wanted to make myself a voice assistant in python. Initially, it was decided to use the speech_recognition library for recognition . As it turned out, I'm not the only one . For recognition, I used Google Speech Recognition, since it was the only one that did not require any keys, passwords, etc. For speech synthesis was taken gTTS. In general, it turned out almost a clone of this assistant, because of what I could not calm down.

True, I could not calm down just because of this: I had to wait for a long answer (the recording did not end immediately, sending the speech to the recognition server and the text for the synthesis took a lot of time), the speech was not always recognized correctly, it was necessary to shout more than half a meter from the microphone , it was necessary to speak clearly, the speech synthesized by Google sounded awful, there was no activation phrase, that is, the sounds were constantly recorded and transmitted to the server.

The first improvement was speech synthesis using the yandex speechkit cloud:

URL = 'https://tts.voicetech.yandex.net/generate?text='+text+'&format=wav&lang=ru-RU&speaker=ermil&key='+key+'&speed=1&emotion=good' response=requests.get(URL) if response.status_code==200: with open(speech_file_name,'wb') as file: file.write(response.content)

Then came the line of recognition. I was immediately interested in the inscription "CMU Sphinx (works offline)" on the library page . I will not talk about the basic concepts of pocketsphinx, because before me did chubakur (for which many thanks to him) in this post.

Installing Pocketsphinx

~~I’ll say right away that it’s not easy to install pocketsphinx (at least I didn’t), so pip install pocketsphinx will not work, it will fall with an error, it will swear on a wheel.~~ Installation via pip will only work if you have a swig. Otherwise, to install pocketsphinx you need to go here and download the installer (msi). Please note: the installer is only for version 3.5!

Speech recognition with pocketsphinx

Pocketsphinx can recognize speech from both a microphone and a file. He can also search for hot phrases (I didn’t really succeed, for some reason, the code that should be executed when a hot word is found is executed several times, although I only pronounced it once). It differs from cloud solutions pocketsphinx in that it works offline and can work on a limited vocabulary, as a result of which accuracy is increased. If interested, there are examples on the library page . Pay attention to the item "Default config".

Russian language and acoustic model

Initially, pocketsphinx comes with English language and acoustic models and a dictionary. Download Russian can be on this link . Archive to unpack. Then you need to move the folder <your_folder>/zero_ru_cont_8k_v3/zero_ru.cd_cont_4000 C:/Users/tutam/AppData/Local/Programs/Python/Python35-32/Lib/site-packages/pocketsphinx/model <your_folder>/zero_ru_cont_8k_v3/zero_ru.cd_cont_4000 to the folder C:/Users/tutam/AppData/Local/Programs/Python/Python35-32/Lib/site-packages/pocketsphinx/model , where <your_folder> is the folder into which you unpacked the archive. Moved folder is an acoustic model. The same procedure should be done with the files ru.lm and ru.dic from the folder <your_folder>/zero_ru_cont_8k_v3/ . File ru.lm is a language model, and ru.dic is a dictionary. If you did everything correctly, the following code should work.

 import os from pocketsphinx import LiveSpeech, get_model_path model_path = get_model_path() speech = LiveSpeech( verbose=False, sampling_rate=16000, buffer_size=2048, no_search=False, full_utt=False, hmm=os.path.join(model_path, 'zero_ru.cd_cont_4000'), lm=os.path.join(model_path, 'ru.lm'), dic=os.path.join(model_path, 'ru.dic') ) print("Say something!") for phrase in speech: print(phrase)

Check beforehand that the microphone is connected and working. If Say something! does not appear for a long time Say something! - this is normal. Most of this time is the creation of an instance of LiveSpeech , which is created for so long because the Russian language model weighs more than 500 (!) Mb. My LiveSpeech instance LiveSpeech about 2 minutes.

This code should recognize almost any phrases you say. Agree, the accuracy is disgusting. But it can be fixed. And to increase the speed of creating LiveSpeech also possible.

Jsgf

Instead of the language model, you can make pocketsphinx work on a simplified grammar. To do this, use the jsgf file. Its use speeds up the creation of a LiveSpeech instance. How to create grammar files is written here . If there is a language model, then the jsgf file will be ignored, so if you want to use your own grammar file, you need to write this:

 speech = LiveSpeech( verbose=False, sampling_rate=16000, buffer_size=2048, no_search=False, full_utt=False, hmm=os.path.join(model_path, 'zero_ru.cd_cont_4000'), lm=False, jsgf=os.path.join(model_path, 'grammar.jsgf'), dic=os.path.join(model_path, 'ru.dic') )

Naturally, the grammar file should be created in the C:/Users/tutam/AppData/Local/Programs/Python/Python35-32/Lib/site-packages/pocketsphinx/model . And another jsgf : when using jsgf will have to speak more clearly and separate words.

Create your own dictionary

A dictionary is a set of words and their transcriptions, the smaller it is, the higher the recognition accuracy. To create a dictionary with Russian words, you need to use the project ru4sphinx . Swing, unpack. Then open the notebook and write the words that should be in the dictionary, each with a new line, then save the file as my_dictionary.txt in the text2dict folder, in UTF-8 encoding. Then open the console and write: C:\Users\tutam\Downloads\ru4sphinx-master\ru4sphinx-master\text2dict> perl dict2transcript.pl my_dictionary.txt my_dictionary_out.txt . Open my_dictionary_out.txt , copy the contents. Open the notepad, paste the copied text and save the file as my_dict.dic (instead of "text file", select "all files"), in UTF-8 encoding.

 speech = LiveSpeech( verbose=False, sampling_rate=16000, buffer_size=2048, no_search=False, full_utt=False, hmm=os.path.join(model_path, 'zero_ru.cd_cont_4000'), lm=os.path.join(model_path, 'ru.lm'), dic=os.path.join(model_path, 'my_dict.dic') )

Some transcriptions may need to be corrected.

Using pocketsphinx via speech_recognition

Using pocketsphinx via speech_recognition only makes sense if you recognize English. In speech_recognition, you cannot specify an empty language model and use jsgf, and therefore you will have to wait 2 minutes to recognize each fragment. Verified

Total

Having ditched a few nights, I realized that I wasted my time. In a two-word dictionary (yes and no), the sphinx manages to make mistakes, and often. Eats off 30-40% of celeron'a, and with the language model also a fat piece of memory. And Yandex recognizes almost any speech accurately, but it does not eat memory and processor. So think for yourself whether it is worth undertaking at all.

PS : this is my first post, so I'm waiting for tips on the design and content of the article.

Source: https://habr.com/ru/post/351376/

All Articles