Mathematical model of the phoneme of the human voice
Most modern human speech recognition systems are based on methods that break voice recording into phonemes and analyze their amplitude-frequency characteristics in order to search for phonemes of individual letters based on their classification by specific sets of frequency characteristics. Such methods consider each phoneme as a single indivisible unit of the audio signal with quasi-stationary frequency characteristics. With this approach, phoneme characteristics that are dynamically changing over time are not taken into account.
But such approaches to the analysis of speech can be used not only for its recognition, but also for teaching an analytical description of phonemes, building a mathematical model from the data obtained and synthesizing sound, practically similar to the original.
Analysis of the components of human speech
Everyone from school still knows that a word consists of one or several syllables, which in turn consist of one or several phonemes. A phoneme is such a minimal unit of a language (the most important thing is that it is discriminative), it does not have any lexical or grammatical meaning, but serves so that we can understand the elementary units of the language - words. ')
Here is the amplitude-time characteristic of the phoneme of the letter “O”.
For convenience, I have noted here three different time intervals:
a - excursion process (each phoneme begins with this process)
b - the process of aging (the very "place" phonemes, which requires a description)
c - the process of rekkursii (roughly speaking - finished speaking, the sound ended too :))
I conducted an analysis of the length of time during which the phoneme (its amplitude-time characteristic) remains in a quasistationary state. Here we can assume that it is at this point in time that the (almost) components of the sound spectrum remain unchanged. For further analysis and description, it is necessary to decompose the sound extracts into spectral components.
But a phoneme, like an atom, it would seem impossible to divide into components what seems indivisible. But this is not so: each peak on the graph above corresponds to one harmonic component of the phoneme - the formant. Thus, each phoneme can be described by describing its simplest components. And with the latter problems should arise from no one. If you look closely at the graph, you can easily determine what the formant is described simultaneously by two parameters: frequency and relative amplitude. Accordingly, purely mathematically, these two parameters form a vector, and the set of such vectors, corresponding to the existing significant formant, corresponds to the matrix of parameters.
Then the phoneme (quasistationary process) can be characterized by the following set of parameters:
Here are the parameters for some other vowels. The letter A is the amplitude, respectively, v is the frequency. It is fair to note that the most "complex" letters are "E" and "I" - the spectrum of their phonemes is wider, and the significant frequencies are in two different intervals.
Phoneme synthesis
To realize the possibility of assessing the quality of the described method, a model was proposed for reconstructing the human speech phonemes using the obtained parametric matrices: . Here, under the amount sign, there is a formal record of the formant. Accordingly, using the data from the table above, you can make a sound model, for example, the letters "Y" and synthesize it.
The set of parameters of the matrix values depends on the properties of the phoneme. Thus, a matrix consisting of eighteen numerical parameters describing nine significant formant is used for realistic reconstruction of the sound recording of the vowel "U". To build a more accurate model, it is necessary to take into account all significant formant phonemes. Another condition for the accuracy of the comparison of the original and synthesized signal is the equal duration of the sound signals.
Conclusion and conclusions
You understand that the phoneme is not such an indivisible unit in the analysis of human speech. I also showed you a simple way of analytically describing formants of human speech phonemes. In the last section, we analyzed that it was possible to construct a mathematical model of a phoneme from the parameters obtained, and the resulting model, in turn, could be used to synthesize phonemes. I hope you liked this material. In the next article we will analyze the complexity of the emotional coloring of the voice and how it could be used to build mathematical models empirically.