
No, this post will not be another duck on any legal disputes, news in the spirit of "the main boss of one company called the other" and conjectures about the plot. It will be about speech recognition engines and speech generation from text provided by Google and Microsoft, their internal compatibility and mutual friendship.
As you know, both Google and Microsoft have speech recognition and text-based speech generation. In Google, these tools are online, used for translation and search, for Microsoft, they are built into the operating system, and are used as tools for additional interaction with the interface. Let's try to
cross a bulldog with a rhinoceros to compare how well these things work with each other. To do this, I will take 10 fairly well-known English-language phrases (I don’t have any illusions about the Russian language at all), generate audio files from them using the engines of both companies and try to recognize the files (again, in two ways).
Used mechanisms
Google audio-by-text generation:
google translateGoogle text generation by audio: the
program of the respected
Yakhnev , which had to be slightly corrected (long live opensource).
Microsoft audio text generation:
Anna engine
Microsoft text generation by audio:
Windows Speech Recognition')
Test phrases
- May the Force be with you.
- A martini. Shaken, not stirred.
- History of the past events.
- Leap for mankind
- Do the right thing. It will gratify some people and astonish the rest.
- I have a dream that one day this nation will rise up.
- Elementary, my dear Watson.
- Life was like a box of chocolates: you never know what you're gonna get.
- Behind every great fortune there is a crime.
- Genius is one percent inspiration and ninety-nine percent perspiration.
Who is bored - you can remember where the phrase comes from (only without Google - otherwise it’s not interesting).
So,
here are the resulting audio files.
Recognition Results
Here's how Google recognized the audio that it itself generated:
- may the force be with you - 100%
- a martini shaken not stirred - 100%
- 500 error - 0%
- that's 1 small step for man 1 giant leap for mankind - 92%
- direct - 77%
- I have a dream that 1 day this nation will rise up - 100%
- elementary my dear watson - 100%
- youn’t know what you’re gonna get - 100%
- back every great fortune terrace brookline - 50%
- genius is 1 percent inspiration and 99 percent perspiration - 100%
Average result:
82% . It should be noted that Google could not recognize the third phrase at all - it gave an error.
Here's how Google recognized the audio that created the voice engine from Microsoft:
- may the force be with you - 100%
- m martini shaken not stirred - 80%
- history of the past few bands - 93%
- that's 1 small step for man 1 giant leap for mankind - 92%
- astonish arrest - 85%
- I have a dream that 1 day this nation will rise up - 100%
- elementary my dear watson - 100%
- life is like a box of chocolates - 93%
- fortune there is a crime - 100%
- genius is 1 percent inspiration and 99 percent perspiration - 100%
Average result:
94%Google understands Microsoft is 13% better than itself! .
Funny, agree. Although, if you think - nothing strange about it. Microsoft's Anna sounds more strictly, ironically, makes a pause between words and the person’s ear sounds more mechanized to the eye than the translator from Google. So it’s logical that Google’s more “humane” voice is less recognized.
As for the attempt to recognize audio files by means of Windows - failure awaited me. Firstly, my Russian-speaking Windows does not know how to do it at all (but these are trifles), and secondly, Microsoft's voice recognition works on completely different principles. It is built on the mechanism of learning and it becomes better, the longer you will learn the computer to understand you. I did not think of whether it is worthwhile in this experiment not to train the engine at all (but in this case I don’t even understand how to start it) or train it “until blue in the face” until everything is recognized - and I decided not to conduct such an experiment. If anyone is interested in doing this, I’ll once again give you a
link to test audio files and an
article about how to make a program that recognizes text from audio files, not a microphone input.
Since the experiment turned out to be focused on Google technologies, I post a topic in his blog.