
Microsoft's trained neural networks now recognize the human voice as well as humans. Speech & Dialog’s research team’s
report says that the speech recognition system is now mistaken as often as professional stenographers. In some cases, the system is capable of making fewer errors.
During the tests, the word error rate (WER) was 5.9%, which is lower than the previous result of 6.3%, which Microsoft
reported last month. This is the lowest result ever recorded. The team does not consider this a breakthrough in the algorithm or data, but in the careful tuning of existing AI architectures. The main difficulty lies in the fact that
even if the sound track of good quality and does not contain extraneous noise, the algorithm must deal with different voices, interruptions, vibrations and other nuances of a person’s lively speech.
')
To test how much the algorithm is able to replicate human abilities, Microsoft hired stenographers from the outside for the purity of the experiment. The company had already prepared the correct transcript of the audio file, which was proposed to specialists. The transcribers worked in two stages: first, one person reprinted the audio fragment, and then the second listened and corrected errors in the transcript. On the basis of the correct transcript for standardized tests, the experts, decoding the recording of the conversation on a specific topic, worked by 5.9%, and the result of the free dialogue interpretation revealed 11.3% of errors. After 2,000 hours of training in human speech, the Microsoft system scored 5.9% and 11.1% errors for the same audio files, respectively. This means that the computer can now recognize the words in a conversation as if he were a man. At the same time, the team fulfilled a goal that it set itself less than a year ago, and the result significantly exceeded expectations.
Now Microsoft is going to repeat the same result in a noisy environment. For example, while driving on a highway or at a party. In addition, the company plans to focus on more effective ways to help the technology recognize individual speakers if they speak at the same time, and make sure that the AI works well with more votes, regardless of age or accent. Realizing these opportunities in the future is crucial and goes beyond simple shorthand.
To achieve such results, the researchers used the company's own development, the Computer Network Toolkit. The ability of this neural network toolkit to quickly process learning algorithms on multiple computers running a graphics processor has greatly improved the speed with which they could produce research, and ultimately reach the human level.

This level of accuracy was made possible through the use of three variants of the
convolutional neural network . The first of these was the VGG architecture, characterized by a large number of hidden layers. Compared to networks that were previously used for image recognition, this network uses smaller, deeper filters (3x3), and also uses up to five convolutional levels before merging. The second network is modeled on the ResNet architecture, which adds trunk links. The only difference is that the developers applied packet normalization before calculating the
ReLU . The last convolutional network in the list is LACE. This is a variant of a neural network with a time delay in which each higher level is a nonlinear transformation of weighted sums of lower-level window frames. In other words, each higher level uses a wider context than the lower levels. Lower levels focus on extracting simple local structures, while higher levels extract more complex structures that cover broader contexts.

This achievement is for the company another step towards easy and pleasant communication with the computer. But as long as the computer cannot understand the meaning of what is being said to it, it will not be able to correctly execute the command or answer the question. Here the task is much more difficult. And it forms the basis of what Microsoft is going to do in the coming years. Earlier this year, Satya Nadella said that artificial intelligence is the “future of the company,” and his ability to communicate with a person has become a cornerstone. “The next frontier is a transition from recognition to understanding,” said Jeffrey Zweig, head of the Speech & Dialog research group.
Despite the obvious success, there is one big difference between the automatic system and the work of the stenographers: she cannot understand subtle conversational nuances like the sound “uh”. We can pronounce this sound involuntarily in order to “beat” something with a pause while thinking about the next thought that needs to be said. Or “uh” can be a signal that the other person can continue to speak, like “aha”. Professional transcribers are able to distinguish them among themselves, but these small signals are lost to artificial intelligence, which is not able to understand the context in which a particular sound was made.
“Five years ago I would not even have thought that we could achieve such a result. I just would not have thought that this was possible, ”said Harry Sham, executive vice president and head of the Microsoft Artificial Intelligence Research Group.
The first research in the field of speech recognition can be attributed to the 1970s, when the United States Defense Advanced Research Projects Agency (DARPA) set the task of creating breakthrough technology in the interests of national security. For decades, most of the largest IT companies and many research organizations have joined the race. “This achievement is the culmination of more than twenty years of effort,” notes Jeffrey Zweig.
Microsoft believes that the result of work on speech recognition will have a great impact on the development of consumer and business products of the company, the number of which will increase significantly. New features from existing developments will receive, at a minimum, the Xbox and Cortana. In addition, each user can use the tools to instantly translate speech into text.