Speech recognition on your own website: Speereo recognition test bench

Hello habravchane! We are glad to appear on Habré and we hope that we will be here for a long time and will be useful both for you and for ourselves.

So, with a thrill to the first post!

Problem

Often on websites, users have to fill out request forms. These can be the names of railway stations or airports in the ticket booking service, the names of streets in the search on the map, the names of goods or groups of goods in the online store, and finally, the usual search on the site or forum.
In all of these cases, a selection is made from some previously known list or indexed set of words / phrases.
It is especially inconvenient to enter such textual queries when searching from a smartphone. Sometimes it’s so inconvenient to do that we refuse to use the service, decide to “do it later” and forget.
')

Decision

We offer an alternative solution to such problems using our own "cloud" continuous speech recognition embedded in your website.

How it works?

On the Internet page, next to the entry field for the search query, you place our brand microphone button. Next, the following happens: on the client side after clicking on the button, a sound file is recorded and sent to our server. The server recognizes the voice signal in a fraction of a second and sends the recognized text or number from the list back to the client or directly to your server.
Example: on the ticket booking page, instead of choosing a city from the drop-down menu or text search, you click on the microphone icon and say “from Moscow”, and then “to St. Petersburg”, then choose the date - “after tomorrow” or “May 9” ". Back you get the same result as with traditional search.

What is the difference?

The difference between our approach and the speech recognition approach, say Google, is as follows: in each case we recognize a limited set of phrases in advance. Google recognizes everything without restricting the developer.

Now two questions. First: what is easier to use? At first glance - Google engine. Whatever the user says, everything enters the input form. However, if you need not just information noise, but specific data, you will have to write a handler that cuts wrong data and recognition errors (and they will!). In our case, this is not necessary. What is not listed in the form simply does not fall.

The second question is with the quality of recognition. And this is the primary task! The more errors in voice input, the less usability . If the recognition drops below 90% - this is a disaster. And here we win, and here's why: theory and common sense. Let us explain: the narrower the sample of possible commands, the lower the probability of error. The speech signal itself does not carry enough information for recognition, 10–15 and sometimes 90 percent of the meaning (or quality of the sound signal) of the human brain “gets” by understanding, discarding the acoustically similar, but obviously incorrect variants of the recognized phrase. Proving it is easy - try to write a phrase in an unfamiliar language by ear.

Initially, Google’s speech recognition system was created for web search and the level of “understanding” in it is performed by a search index. Google doesn’t recognize that someone hasn’t searched for a hundred times; possible errors are leveled by the fact that the request you need, even if it is not correctly recognized, will appear on the first lines of the search results, and, therefore, you will not perceive this as an error.

In the case of entry into forms, this approach does not work. You can be sure - too many mistakes.
Why do we have less mistakes? In our technology, the role of “understanding” is performed by a list of possible phrases in this particular context. This list is created directly by the developer - the site administrator. It is foolish to load the system with an exhaustive search of millions of alternatives if you only need to recognize the destination station when ordering a train ticket. There are only about five thousand stations, and the Speereo recognition system does an excellent job with them.

Over time, we will save you in most standard cases, even from making lists. They will already lie in the public place of our server as the fruit of collective efforts. The sets will be approximately such “yes-no in all possible non-abusive variants”, “all streets of the city N”, “names of all drugs in the Russian Federation”, “all names”, “all surnames”, “all musical groups by genre”, "All movies by genre" and the like. Speaking of last names: often used input field, isn't it? So, there is no my last name in the Google dictionary and I cannot enter it in any way. In the case of using Speereo, the surname entered once by hand will be included in the list of voice commands. So if the words you need are not in the Google dictionary, then their engine does not suit you. And ours fits. Another pleasant trifle - we work with any browsers and platforms, though not for free (be afraid of the Danaans who bring gifts).

How to test our cloud solution and recognition quality:

For the test, we offer three versions of test benches that are configured to recognize all Moscow streets and house numbers from 1 to 300:
- stand with downloading files and stand with technology Silverlight ;
- stand with downloading files to Flash ;
- stand for "pure experiment" - with downloading files recorded on your PC .

~~To check the performance of the Speereo recognition system, register - this is necessary to distribute the load on the server (an attempt to combat the hardware effect) .~~ In connection with dissatisfaction - registration is removed! Try on health!
The two test variants differ as follows: the file upload page is the most pure experiment that allows us to test the system, observing all of the conditions we prefer (the quality of the sound signal, etc.); Silverlight page - express testing method, where you need to install Silverlight to activate the microphone; The Flash page is the fastest way with the most sensitive sound input (do not shout into the microphone, lower the recording level). It is worth noting that the second and third methods, although faster, do not guarantee full compliance with the recognition conditions, as they work with your system sound settings, which can reduce the quality of the recorded and transmitted signal.

NB: At once we say that the sensitivity of flash to the recorded signal is much higher than that of Silverlight and, therefore, do not raise the voice during tests or lower the recording level. ~~From myself I will say - the level of recording on flash in my laptop is 30-40% lower than I expose for Silverlight.~~ Fixed, thanks everyone for the comments.

How to connect the "recognition cloud" to your site

Check out the technical details of the connection:
a) Send us lists of commands for each button in text files in a column. Specify the numbers or names of the buttons in the file name.
b) Test the work of the recognizer for several days at the address in the reply letter.
c) Choose a tariff, pay the minimum payment and use!

Tariffs

Tariff "one to one"
4 recognition per second *, 1 month - 99 tons rub. The minimum advance payment is 99 tons rubles.

Tariff "one to ten"
4 recognition in 10 seconds *, 1 month - 9.9 tons of rub. The minimum advance payment is 9.9 tons of rubles.

One to one hundred fare
4 recognition in 100 seconds *, 1 month - 990 rubles. The minimum advance payment is 4500 rubles.

Tariff "web king"
1 recognition for 10 kopecks **, Minimum advance payment - 10 tons rub.

Tariff "with syrup"
Recognition rate by request. 1 recognition for 3 kopecks. - call.

Tariff "cloud in pants"
Placement of the recognizer on the client's server.
Annual payment - call.

* - guaranteed rate of issuing recognition results by the cloud with a delay of recognition of not more than a second. The server connection speed is sufficient to receive the corresponding volume of sound files.

** - guaranteed rate of issuance of recognition results by a cloud with a delay of recognition of not more than a second - 4 per second.

Source: https://habr.com/ru/post/120023/

All Articles