Speech recognition. Part 2. Typical speech recognition system structure
Speech recognition is a multi-level pattern recognition problem in which acoustic signals are analyzed and structured into a hierarchy of structural elements (for example, phonemes), words, phrases, and sentences. Each level of the hierarchy may provide for some time constants, for example, possible word sequences or known types of pronunciation that can reduce the number of recognition errors at a lower level. The more we know (or assume) a priori information about the input signal, the better we can process and recognize it. The structure of the standard speech recognition system is shown in the figure. Consider the basic elements of this system.
Raw speech. Typically, a stream of audio data recorded with high sampling (20 kHz for recording from a microphone or 8 kHz for recording from a telephone line).
Signal analysis. The incoming signal must be initially transformed and compressed, to facilitate subsequent processing. There are various methods for extracting useful parameters and compressing source data dozens of times without losing useful information. The most used methods are:
Fourier analysis;
linear speech prediction;
kepstralny analysis.
Speech frames. The result of signal analysis is a sequence of speech frames. Usually, each speech frame is the result of analyzing a signal over a short period of time (about 10 ms.), Containing information about this segment (about 20 coefficients). To improve the quality of recognition, information about the first or second derivative of the values ​​of their coefficients can be added to the frames to describe the dynamics of changes in speech.
Acoustic models. To analyze the composition of speech frames, a set of acoustic models is required. Consider the two most common ones.
Template model. An acoustically preserved example of a recognized structural unit (words, commands) acts as an acoustic model. The variability of recognition by such a model is achieved by storing different variants of the pronunciation of the same element (many speakers repeat the same command many times). It is used mainly for recognizing words as a whole (command systems).
State model Each word is modeled as a sequence of states indicating a set of sounds that can be heard in a given segment of a word, based on probabilistic rules. This approach is used in larger systems.
Acoustic analysis. It consists in the comparison of different acoustic models for each frame of speech and provides a matrix for matching the sequence of frames and a variety of acoustic models. For the template model, this matrix represents the Euclidean distance between the template and recognizable frame (that is, it calculates how much the received signal recovers from the recorded template and finds the template that best suits the received signal). For state-based models, the matrix consists of the probabilities that a given state can generate a given frame.
Time adjustment Used to handle the temporal variation that occurs when words are pronounced (for example, “stretching” or “eating” sounds).
The sequence of words. As a result of the work, the speech recognition system produces a sequence (or several possible sequences) of words, which most likely corresponds to an input stream of speech.
UPD: Transferred to "Artificial Intelligence." If there is interest, I will continue to publish in it.