Why do we need it
When the conversation comes to speech recognition, it is impossible to remain solely in the field of "signal analysis" (for that is, individual works and branches of science). Always remember that when analyzing speech, we work with a special kind of signal that is reproduced by a specific biological system. On the one hand, it is limited by its amplitude-frequency characteristics (AFC), and on the other hand, by the language itself and the standard set of sounds that can be pronounced by its carrier (for example, when analyzing the Russian language, we will not take into account the possibility of clatter and whistling ). Based on the task, you can quite accurately determine the characteristics of the speech signal, and its basic properties.

On the other hand, for this signal, nature has developed a close-to-ideal receiver. This is our auditory tract. So far, no other system has been invented and has not been found that could equally well and accurately engage in speech recognition. It would be blasphemy to neglect the opportunity to learn this from nature. If you get acquainted with the features of the auditory tract closer, you begin to understand that the wavelets and Fourier transform in such tasks did not come from the ceiling. And the systems providing the decomposition of the signal into the frequency spectrum appeared much earlier than the first rock painting ...
Voice path
The voice signal is created using air waves emitted from the mouth and nasal openings of the speaker. In most languages ​​of the world, the composition of phonemes can be divided into 2 main classes:
- consonants - pronounced when there is compression of the throat or obstructions in the mouth (tongue, teeth, lips) of the speaker;
- vowels - pronounced in the absence of any obstacles in the vocal tract.
Further, on the basis of various articular properties, sounds can be classified into smaller classes. These parameters are generated from the anatomy of various human articulators and their points of contact of the vocal tract. The lungs, the trachea, the larynx, the pharyngeal cavity (throat), the oral and nasal cavity make a significant contribution to speech production.

- The lungs are the source of air in the process of speech.
- The vocal cords: when the vocal cords are at a small distance from each other and oscillate relative to each other in the process of speech, they say that the sound is vocalized. If the bundles do not fluctuate, then they say that the sound is unvoiced.
- Soft palate: works like a flap that opens the air passage to the nasal cavity.
- Hard palate: the long, relatively hard surface of the upper wall of the oral cavity, in combination with the tongue allows you to pronounce consonant sounds.
- Language: flexible articulator. When away from the sky allows you to pronounce vowels, when approaching the sky - consonants.
- Teeth: in combination with the tongue, used to pronounce some consonant sounds.
- Lips: can be rounded or stretched, changing the sound of vowel sounds, or close to stop the air flow when pronouncing some consonant sounds.
The main difference between sounds is their distinction between
vocalized and
unvoiced sounds.
Voiced sounds in their frequency and time structure have a quasi-periodic component. It is introduced when the vocal cords are involved in the pronunciation of the sound, vibrating at different frequencies (from 60 Hz in an adult man to 300 Hz or higher in a girl or child). The frequency of vibration of the vocal cords is called the fundamental frequency of the sound, as it is the base frequency for the other high-frequency harmonics created in the laryngeal and oral cavity. Also, the main frequency more than any other factor affects the main tone of speech.
The figure shows the stages of the cycle of the state of the human vocal cords as the air flow passes through them. At
stage (a) , the glottis are closed and the air flow stops in front of the vocal cords.

At some point (
stage b ), the air pressure in front of the ligaments overcomes the barrier, and the air is pulled out through the glottis. However, the tissues and muscles of the vocal cords, due to the natural elasticity, return to their original state, closing the glottis (
stage c ). This creates a sequence of sound vibrations, which is the source of energy for all vocalized sounds.
')
When pronouncing
unvoiced sounds, the vocal cords are either relaxed or highly stressed, as a result of which they do not produce sound vibrations. Air flows freely from the lungs into the oral and / or nasal cavity of the vocal tract. As a result of the interaction of air with different articulators, the airflow is transformed, which results in the pronunciation of a particular sound.

The figure shows an example of a signal corresponding to two sounds: voiced “O” and unvoiced “T”. Obviously, they have obsalyutno different properties that need to be considered in the analysis. The problem with speech recognition occurs when the word starts or ends with an unvoiced sound. In this case, it is necessary to use special algorithms to distinguish this sound from extraneous noise and accurately determine the beginning point (end) of the speech signal. We will talk about such algorithms in the following sections.
Auditory tract
In the system of speech perception there are 2 main components: external auditory organs and the auditory part of the brain. The ear processes the signal that the sound wave carries in itself by converting it into the mechanical vibration of the eardrum and then mapping this vibration into a sequence of impulses transmitted by the auditory nerve. Useful information is extracted in different parts of the auditory part of the human brain.

The human ear consists of 3 sections: the outer ear, middle ear and inner ear.
The outer ear consists of the visible part and the external auditory canal, which ends with the eardrum. The sound, passing through the external sound channel, affects the eardrum and it viriruet.
The middle ear is an airy area of ​​approximately 6 cm3. The vibrations of the eardrum are transmitted by the sound-bone system (hammer, anvil and stirrup) into the membrane, which is called the “oval window”. This is the interface between the middle ear and the inner ear (cochlea), since the rest of the inner ear consists of bone tissue.

Important for the perception of sound, the structure of the inner ear is the
cochlea , which communicates directly with the auditory nerve. The longitudinal membrane divides the coil of the cochlea into two parts filled with liquid. The inner surface of the cochlea is covered with ciliated receptor cells, which are connected directly to the auditory nerve and perceive information about the fluid pressure at a certain point of the cochlea. The structure of the inner ear is designed so that at different frequencies of the initial signal, the maximum amplitude of change in fluid pressure in the cochlea will be recorded at a certain distance from its base (look at the figure). Thus, the
cochlea can be represented as a combination of filters, the output signal of which is ordered by distance from the base of the cochlea . Filters closer to the bottom of the cochlea are responsible for higher frequencies.
The auditory nerve is a collection of frequency channels. Each frequency channel includes a group of neurons connected to one or adjacent
filters of the cochlea , that is, those that have the same or similar characteristic frequencies. This set of features is served as an instantaneous image of the signal to the human brain, in which, through a complex neural network, useful information is extracted from the received signal. Unfortunately, there is no exact data on how this information is extracted inside the human brain. There are only a number of theories that describe in different ways possible neural structures inside the brain and their interaction.

Scales
Many elements of various speech recognition systems are based on the human auditory tract and try to imitate the mechanisms of its work. So, the most popular today characteristic characteristic of the speech signal (
MFCC-coefficients ) is based on the study of methods of signal conversion in the inner ear of a person. Also, the development and development of neural network algorithms are associated with studies of the human brain.
Studies have been conducted to extract a gradation of frequencies that would model the natural response of the human speech perception system, in which the cochlea acts as a spectral analyzer. The complex mechanism of the inner ear and the auditory nerve suggests that the properties of perception of sounds at different frequencies cannot obviously be simple or linear. It is widely known that in modern Western culture, musical tone is divided into octaves and half tones.
The frequency f1 is higher than the frequency f2 per
octave if and only if f1 = 2f2. In 1 octave there are 12 semitones, therefore, f1 is higher than the frequency f2 by a
semitone if and only if

As a result of various studies based on the human sensations of the sounds of different frequencies, a series of scales was derived, which made it possible to imagine the frequency of sound in values ​​closer to human perception. Thus, in one of the first attempts to create such a scale, the
Bark scale was developed. It was expected that the processing of spectral energy on the basis of the Bark-scale gives a more accurate match with the information heard by a person.
Bark-scale is divided into 24 main ranges of hearing. Audible resolution at low frequencies is greater than at higher frequencies. You can convert the frequency from Hz to the Bark scale by the following formula:

where f is the frequency of sound in Hz,
b - frequency of sound in Bark.
But more common in recognizing human speech was another scale - the
mel scale , linear at frequencies below 1 kHz and logarithmic at frequencies above 1 kHz. Mel-scale was obtained as a result of experiments with exemplary tones (sinusoids) in which the subjects were required to divide these frequency ranges into 4 equal intervals or adjust the frequency of the desired tone so that it is half the frequency of the original. 1 mel is defined as 1 thousandth tone level in 1 kHz. As with any other attempt to create similar scales, it is calculated that the mel scale more accurately models the sensitivity of the human ear. The calculation of mel values ​​can be approximately represented by the following formula:

where f is the frequency of sound in Hz,
B is the sound frequency in mel.
A number of modern speech signal processing techniques are based on the use of such scales.
Links to home reading
- Huang Xuedong. Algorithm and system development. –New Jersey: Prentice Hall PTR, 2001. 910 p. (The reference book of anyone who wants to engage in speech recognition. Much of what is contained in the cycle of my notes is taken from this book. Must Have.)
- Chistovich L. A., Ventsov A. V., Granstrem M. P. Speech physiology. The perception of speech by man. - L .: Science, 1976. (Unfortunately, books on speech recognition in Russian stopped being released as far back as the 1980s. But even those that were released are worth studying. From this book I gathered information about the auditory tract, device snails. If anyone interested in the performance characteristics of the auditory canal - you are welcome.)
- DongSuk Yuk. Robust speech recognition using neural networks and hidden Markov models. Adaptations using non-linear transformations. - New Jersey: The State University of New Jersey, 1999. (Many American scholars lay out the texts of their dissertations for free access. Many thanks to them for this human thanks.)