Voice control media center

Perhaps the dream of all science fiction writers since the beginning of science fiction as such is voice control of a computer. What else, if not a lively dialogue with the machine, makes it possible to imitate the artificial intelligence of the last and gives grounds to expect that the coffee grinders will go crazy sooner or later, take the world and put insignificant people in the matrix?

The first attempts to implement speech recognition occurred in the middle of the last century, and with the spread of personal computers, the desire to use their power for this process turned out to be natural. I remember that about 15 years ago, there were already programs for Windows that made it possible to create macros corresponding to voice commands. With their help, I threw the guests into a holy thrill, when, in response to a request to go through three letters, Windows finished its work and gave way to the classic inscription "now the computer's power can be turned off." The basis of the work of these programs was a comparison of the commands received with those recorded in advance. This comparison took place with the help of sound wave analysis, and the minus of this approach is obvious - the commands should be pronounced with the same intonation and, preferably, in the same state of consciousness.

Picture on request "voice control computer". Harrison Ford kind of tells us to “enhance 34 to 36”, whatever that means ...

A more logical approach is the analysis of the phonetic features of the spoken phrase and an attempt to compare each of the words with the dictionary, which reduces the impact on the result of recognition of such features as the manner of speech and even some “effects of fiction”. So how to recognize the quality of the Russian language? The first to come to mind is Google, providing the appropriate API. Some even quite successfully integrate the use of this API into their “smart home” - the script sends every good phrase to the corporation, and then tries to compare the recognized text with one of the specified commands. Naturally, I immediately dismissed this option, otherwise I will have to shut down the system every time I need to discuss how best to get rid of the corpse. Moreover, it is not known how long this freebie will last and whether Google will decide to suddenly block this service.
')
Therefore, when I once realized that I wanted to talk to my HTPC, I turned to offline recognition systems. I started with one of the most popular - CMU Sphinx . The first phrase I tried to convey to her again and again was “turn on the light!”. I provide a log of my testing:

paradise
thinking
and can drink a pint
then at the top of the experience corpse
true nose vodka world
again and again
that fact
five
about it
first this morning
right here

That is, as a generator of lyrics for Zemfira, maybe it will do, but it is not suitable for full use. The adaptation of the acoustic model and the limitation of the dictionary did not improve the situation too much.

At this moment, I came to the conclusion that so far the most sane way to organize voice control is to negotiate with a soulless piece of iron in the language of the most suspected enemy. It is no secret that English is simpler than Russian in many respects, including phonetically, which is especially important for us. And the functionality necessary for English speech recognition is already present in the latest versions of Windows. “Allow me, but we did not finish Oxfords! »- Someone from readers will object. And rightly so. Voronezh Construction College is much better prepared for life in the real world. And the presence of an ideal pronon, as it turned out, is not at all necessary. If the computers of the future understand even the slurred jingle of Harrison Ford, then why are we worse off? For example, my accent is a mixture of Borat and some insane Russian general from a Hollywood thrash film, as can be seen by watching the video below. I was not even too lazy to make subtitles, because I myself hardly understand what I carry there.

How it works?

As a “strip” between the user and Windows Speech Recognition, a product called VoxCommando (~ $ 27) is used. This program, using Windows tools, recognizes a phrase and compares it with user-defined commands. Due to the limitation of the dictionary, the recognition accuracy is close to 100%.

VoxCommando comes with a large number of useful plugins, including for XBMC, which was especially interesting to me. In addition to the XBMC plugin, also noteworthy:

plug-in for EventGhost - I use to send IR-signals to control the TV and receiver.
plug-in arbitrary HTTP requests - referring to the API of the Yandex translator, the one who translated the “snake scale” as a “snake scale”.
There are also plug-ins for Vera and X10 that allow you to control home automation, such as lighting.

Customize voice commands. The left window is a list of commands and their corresponding phrases and their variations. Right - the editor of the current team with a list of necessary actions (in this case - accessing XBMC using the JSON-RPC API).

VoxCommando allows you to use Text-to-Speech engines installed in the system, so you can try to organize a full-fledged dialogue with the machine. I didn’t focus on this, just taught the young lady to answer “I am” to the question “Who’s your daddy?” And calmed down on that.

Microphone

Another important issue is the choice of microphone. Those who have ever encountered speech recognition know that the headset is best suited for this. But giving orders to an artificial intellect, wearing a pile of wires and plastics on its head, is somehow never cyberpunk - in any science fiction film you will be laughed at. Some quite successfully use Kinect or such a thing as The Voice Tracker , but these devices have enough disadvantages - the range of high-quality speech perception is quite limited, high dependence on background noise, and false positives on the content currently being played. It is quite possible that the protagonist of a melodrama, during a declaration of love, casually says the name of a porno-grind style music album, and the media center will take it as an unequivocal signal that it is time to touch the beautiful.

In search of a solution to this problem, I came across Amulet Remote . It looks like an ordinary MCE remote control, but in addition to an infrared transmitter, it also contains a wireless microphone that is activated when the device is brought into a vertical position.

Amulet Remote. When bringing the device to a vertical position, the logo on the remote lights up red, hinting that he wants to communicate.

Despite some shortcomings (short battery life and learning problems compared to conventional consoles), I think this is the most successful device for HTPC voice control at the moment. Now Amulet Remote is offered for $ 69, but since the manufacturer sends its products only to the United States, it will have to use the services of an intermediary company for delivery. The recognition quality using Amulet Remote is at a very high level, and it is not surprising - the device was developed in Ireland and, apparently, was subjected to severe stress testing with an Irish accent.

Conclusion

The option described above can be used not only to control the media center, but also to manage various “smart home” systems, as well as for most other tasks requiring automation, be it to access some web service, launch an application or send IR -signal. For example, using a voice command, you can check the weather or turn on the air conditioner. Unless it is still possible to send for a beer, but we will hope for further steps of technical progress in this direction.

Source: https://habr.com/ru/post/217789/

All Articles

Voice control media center

How it works?

Microphone

Conclusion

More articles: