Epigraph
In Russia, the direction of speech recognition systems is indeed quite poorly developed. Google has long announced a system for recording and recognizing telephone conversations ... Unfortunately, I haven’t yet heard about systems of similar scale and quality of recognition in Russian.
But it is not necessary to think that everyone has already discovered everything abroad long ago and we will never catch up with them. When I was looking for material for this series, I had to break through a cloud of foreign literature and theses. Moreover, these articles and dissertations were great American scientists
Huang Xuedong; Hisayoshi Kojima; DongSuk Yuk , et al.
Is it clear who this branch of American science rests on? ; 0)
In Russia, I know only one sensible company that has managed to bring domestic speech recognition systems to a commercial level: the
Center for Speech Technologies . But, perhaps, after this series of articles, someone would think that it is possible and necessary to engage in the development of such systems. Moreover, in terms of algorithms and mat. apparatus, we almost did not fall behind.
')
Classification of speech recognition systems
Today, under the concept of "speech recognition" hides a whole field of scientific and engineering activities. In general, each speech recognition task comes down to isolating, classifying, and appropriately responding to human speech from an input audio stream. This may be the execution of a specific action on a command of a person, and the selection of a certain marker word from a large array of telephone conversations, and a system for voice input of text.
Signs of classification of speech recognition systems
Each such system has some tasks that it is designed to solve and a set of approaches that are used to solve the set tasks. Consider the main features that can classify human speech recognition systems and how this feature can affect the operation of the system.
- The size of the dictionary. It is obvious that the larger the size of the dictionary, which is embedded in the recognition system, the greater the frequency of errors in the recognition of words by the system. For example, a dictionary of 10 digits can be recognized almost unmistakably, while the frequency of errors in recognizing a dictionary of 100,000 words can reach 45%. On the other hand, even recognizing a small dictionary can produce a large number of recognition errors if the words in this dictionary are very similar to each other.
- Voice announcement or voice independent system. By definition, the announcer-dependent system is intended for use by a single user, while the speaker-independent system is designed to work with any announcer. Dictoronezavisimosti is a hard-to-reach goal, since when training a system, it is adjusted to the parameters of the speaker whose example is being taught. The frequency of recognition errors of such systems is usually 3-5 times higher than the frequency of errors of dictate-dependent systems.
- Separate or continuous speech. If in speech each word is separated from another by a section of silence, then they say that this speech is separate. Fluent speech is naturally pronounced sentences. Recognition of continuous speech is much more difficult due to the fact that the boundaries of individual words are not clearly defined and their pronunciation is greatly distorted by blurring the spoken sounds.
- Purpose The purpose of the system determines the level of abstraction required, at which speech speech recognition will occur. In the command system (for example, voice dialing in a cell phone), it is likely that word or phrase recognition will occur as the recognition of a single speech element. And the text dictation system will require greater accuracy of recognition and, most likely, in interpreting the spoken phrase, it will rely not only on what was said at the moment, but also on how it relates to what was said before. Also, the system must have a built-in set of grammatical rules that the spoken and recognizable text must satisfy. The stricter these rules, the easier it is to implement the recognition system and the more limited the set of sentences that it will be able to recognize.
Differences in speech recognition methods
When creating a speech recognition system, it is required to choose which level of abstraction is adequate to the task, which sound wave parameters will be used for recognition and methods for recognizing these parameters. Consider the main differences in the structure and operation of various speech recognition systems.
- By type of structural unit. When analyzing speech, individual words or parts of spoken words, such as phonemes, di- or triphons, allophones, can be chosen as the base unit. Depending on which structural part is chosen, the structure, universality and complexity of the dictionary of recognizable elements change.
- By highlighting the signs. The sequence of readings of the sound wave pressure is excessively excessive for sound recognition systems and contains a lot of unnecessary information that is not needed during recognition, or even harmful. Thus, in order to represent a speech signal, it is necessary to select from it any parameters that adequately represent this signal for recognition.
- According to the mechanism of operation. In modern systems, various approaches to the functioning of recognition systems are widely used. The probabilistic-network approach consists in breaking the speech signal into specific parts (frames, or by phonetic attribute), after which a probabilistic assessment is made of which particular element of the recognizable dictionary the given part is related to and (or) the entire input signal. The approach based on solving the inverse problem of sound synthesis is that the input signal determines the nature of the movement of the vocal tract articulators and, using a special dictionary, the pronounced phonemes are detected.
UPD: Transferred to "Artificial Intelligence." If there is interest, I will continue to publish in it.