The history of the development of speech recognition systems: how we came to Siri
Looking back, we see that the development of speech recognition technology is similar to the observation of the process of maturation in a child - progressing from the definition of individual words, then all large dictionaries, and finally to quick answers to questions like Siri does.
Listening to Siri with her slightly elegant sense of humor, we admire how far they have been in the speech recognition industry. Let's take a look at the past decades, which allowed a person to control devices using only a voice.
1950 and 1960: Baby talk
The first speech recognition systems could only understand numbers (given the complexity of the language, it’s right that engineers first focused on numbers). Bell Laboratories developed the Audrey system, which recognized numbers spoken in a single voice. After 10 years, in 1962, IBM demonstrated their offspring - the “ Shoebox ” system, which understood 16 words in English. ')
Laboratories in the USA, Japan, England and the USSR have developed several more devices that recognize individual spoken sounds, expanding the speech recognition technology with support for four vowels and nine consonant sounds. They did not sound very good, but these first attempts gave an impressive start, especially considering how primitive the computers of that time were.
1970s: Systems gradually gaining popularity.
Speech recognition systems made great strides in the seventies thanks to the interest and sponsorship from the US Department of Defense. Their DARPA Speech Understanding Research (SUR) program from 1971 to 1976 was one of the largest in the history of speech recognition, and among other things, she was responsible for the Harpy system at Carnegie Mellon University. Harpy understood 1011 words, which is the average vocabulary of a three-year-old child.
Harpy was a significant milestone, as it introduced a more efficient search approach, called Beam Search , “showing off a network of possible sentences with a finite number of states” ( Readings in Speech Recognition ).
The 70s are also marked by several other milestones in this technology, for example, the founding of the first commercial company Threshold Technology, which introduced a system that could interpret different voices.
1980s: Speech recognition justifies predictions
In the next decade, thanks to new approaches and technologies, the vocabulary of such systems grew from several hundred to several thousand words and had the potential to recognize an unlimited number of words. One of the reasons was the new statistical method, better known as the hidden Markov model .
With an extended vocabulary, speech recognition has begun to tread on a path to commercial applications for businesses and specialized industries, such as medicine. She even entered ordinary people's homes in 1987 as a Worlds of Wonder's Julie doll , which the children could train to recognize their voice (“Finally, a doll that understands you”).
Although the recognition software could recognize up to 5,000 words, such as the Kurzweil text-to-speech program, they had a huge drawback - these programs supported discrete dictation, that is, you had to stop after each word for the program to process it.
1990s: Automatic speech recognition goes to the masses
In the nineties, computers finally got fast processors, and speech recognition software became viable.
in 1990, the first publicly available program, Dragon Dictate, appeared, with a staggering price of $ 9,000. Seven years later, an improved version was released - Dragon NaturallySpeaking . The application recognized normal speech, so you could speak at your usual pace of about 100 words per minute. But still, you had to train the program for 45 minutes before using it, and it still had a high price of $ 695.
The appearance of the first VAL voice portal from BellSouth was in 1996. It was the first interactive speech recognition system that provided information based on what you said on the phone. VAL paved the way for all inaccurate voice menus for the time, which annoyed callers for the next 15 years.
2000s: A speech recognition stagnation - until Google appeared
By 2001, speech recognition had risen to 80 percent accuracy, and the progress of technology stopped. Systems recognized worked perfectly when the language universe was limited, but they were still “guessed” using statistical models among similar words, the language universe grew with the growth of the Internet.
Did you know that voice recognition and voice commands were built into Windows Vista and Mac Os? Most users did not even guess that such functionality exists. Windows Speech Recognition and OS X voice commands were interesting, but not as accurate and convenient as the keyboard and mouse.
Speech recognition technology got a second wind after one important event: the appearance of the Google Voice Search application for the iPhone. The impact of this application was significant for two reasons. First, phones and other mobile devices are ideal objects for speech recognition, and the desire to replace the tiny on-screen keyboards with alternative input methods was great. Secondly, Google had the opportunity to unload this process using its cloud-based data centers, channeling all their power for large-scale data analysis to find matches between the words of users and a huge number of voice query samples they received.
In short, the bottleneck of speech recognition has always been the availability of data and the ability to efficiently process it. The application added billions of searches to the analysis in order to better predict what you said.
In 2010, Google added personal recognition to voice search for Android phones. The software could record user voice requests to build a more accurate voice model. The company also added voice recognition to its Chrome browser in mid-2011. Remember how we started with 10 words and moved up to several thousand? So the Google system now allows you to recognize 230 billion words.
Then Siri appeared. Like Goggle Voice Search, it relies on cloud computing. She uses the data she knows about you to generate a response stemming from the context and responds to your request as a kind of person. Speech recognition has turned from a tool into entertainment.
Future: Accurate and ubiquitous speech
The speech recognition boom of applications indicates that the time for speech recognition has come, and we can expect a huge number of them in the future. These applications will not only allow you to control your computer using your voice or convert your voice to text - they will also be able to distinguish between different languages, will allow you to choose a voice assistant from various options.
It is likely that speech recognition technology will move to other types of devices. It is not difficult to imagine how in the future we will manage the coffee makers, talk to printers and talk to the lights so that it turns off.