📜 ⬆️ ⬇️

A new record in speech recognition: the error rate of the algorithm is reduced to 5.5%


IBM 100: The Origins of Speech Recognition

An ordinary person, on average, misses attention or misunderstands 1-2 words out of 20, spoken by the interlocutor. During a five-minute conversation, the number of words not heard or incorrectly recognized by a person can reach 80. Quite a lot, right? What about computers - what percentage of errors do they have?

Last year, IBM announced a new record in the development of speech recognition technologies. The number of errors made by the service decreased to 6.9%. Since then, the company has improved a lot, which in 2017 made it possible to achieve a new record of 5.5%.

And the discussion here is not at all about recognizing the correctly stated speech, sentences pronounced, for example, by a professional speaker. No, 5.5% is an indicator of speech recognition errors in the course of a discussion by two ordinary people about the possibility of buying a car or other topics.
')
This achievement was made possible by combining LSTM (Long Short Term Memory) and Wave Net language models with three other acoustic models. As a result, the computer in some cases recognizes speech with even fewer errors than a person (here the average is 5.9%). But the IBM developers decided not to stop at what has been accomplished and now they want to achieve a minimum level of errors with a figure of 5.1%.

The speech models currently used are self-learning. And they learn not only on successful cases of recognition of difficult moments of speech, but also on failures - almost as a person. Over time, the system reduces the level of human speech recognition errors, improving the overall result.

Experts believe that computer systems can reach new records - the same level of speech recognition errors of 5.1% so far represents a challenge for scientists and engineers. Moreover, the usual tests can not identify all the problem areas in speech recognition technologies that can occur in the development of specialized systems. “For example, different data sets may be more or less sensitive depending on different aspects of the problem,” says Joshua Bengio, one of the experts working on speech recognition algorithms.

By the way, the result of the performance evaluation of speech recognition technologies largely depends on the evaluation system. For example, the percentage of errors mentioned above were derived from the standards of the SWITCHBOARD assessment methodology. But there is another technique called Call Home. In this case, the number of errors in the speech recognition of family members is estimated when discussing random topics. The result of a person (error rate) - 6.8%. The maximum result obtained by the machine system is 10.3%. Very well, but the machine has not yet reached the level of man.



“The ability to recognize speech in the same way as a person is a challenge for machine learning specialists, since human speech, especially on random topics, is extremely complex,” says Julia Girchberg, a professor at Columbia University. “Also, the problem is the assessment of the level of speech recognition by the person himself, since different people have very different abilities regarding the understanding of the speech of the interlocutors. When we compare a person and a machine, it is very important to take into account the following: the efficiency of the algorithm and the method of estimating the level of errors.

According to analysts from Gartner, IBM's achievements can predetermine the future of the entire field of artificial intelligence and the “Internet of things”.

“With the proliferation of digital assistants like Alexa or Google Assistant, reducing errors in human speech recognition can be an incentive for the ubiquitous use of speech interfaces in both regular and corporate applications,” said Gartner spokesman Mark Hang.

Source: https://habr.com/ru/post/325098/


All Articles