The effect of speech co-articulation and its overcoming in recognition. Allowance for neurospetsnaz

What is co-articulation?

Creepy beast named "allophone"

Saying the words and the sounds that make them, we never think about what they are physically like. How many of intelligent earth creatures speaking different languages tried to record their speech and examine it on graphs, spectrograms? Understand and study its features, identify patterns and in general, learn more about speech? I think very few - as a percentage.
')
We just use ! And we use unconsciously.

We intuitively divide speech into sounds that we write in letters, and it seems to us that the sound “a” is always “a”, and in the word “mom” there are two absolutely identical sounds “a”.

And no! Conduct an experiment: write down the word "mom", and then, using audio processing tools, change the syllables in some places ... It’s difficult to call Velma a mom. I listened to the record for a long time, tried to find a speech equivalent, but it is difficult. The closest thing to my taste is probably “mo-ha”

That is, in general, a word came out of another language! Unknown and invented ...

That is why it is difficult to synthesize speech, and computers when loading us are not yet welcomed by the words in the purest Russian: “Good morning to you, Boss!” ...

So. What we call sounds, bolsheraboe professors of linguistics are called "phonemes". Understandably, the sound can be different - from the squeak of the door to the meowing of a cat, but you must somehow call the sounds of Homo Reasonable speech (although I always had some doubts about the homo sapiens rationality ...)

Well, phonemes, as was already shown by the example of the word "mother", can differ significantly from each other, and on the spectrogram look, to put it mildly, in different ways.

And here we can use the cunning beast called “allophone”.

Allophone - this is a specific phoneme in this place. That is, returning to the word "mom", the second and fourth sound will be here - the same phoneme "a", but allophones are different. The second sound is the phoneme "a" under the stress surrounded by phonemes "m". The fourth sound is an unstressed “a” after the phoneme “m” and before the end of a word (short pause, silence).

That is, the allophone is the realization of a phoneme in a specific sound environment .

So all the same co-articulation?

So, the reason why the phoneme in different places of the word does not resemble itself is simple and banal.

Sounds have no clear boundaries, and it is impossible to determine - here the phoneme “a” ends, and here the phoneme “m” begins.

Speech phonemes pass into each other smoothly, while the sound environment greatly distorts the form of the phoneme.

For example, the spectrogram of the second sound “a” in the word “mother” is seriously affected by the two letters “m” in the neighborhood, and it differs from the fourth phoneme “a”, which has “m” on one side and nothing on the other.

Spectrogram words mom broken down into sounds

Spectrogram words mom broken down into sounds

Fig. 1. Spectrogram of the word mom broken down into sounds

The figure shows that the spectrograms of the same phoneme are significantly different.

As already mentioned, the phoneme, ~~bored by life~~ changed by the given sound environment, is called an allophone.

Big boss - the main allophone

Among the variety of allophones for one phoneme, one of the options is taken as the standard. This pattern is called the “main allophone”.

For vowels, such a standard is isolated pronunciation.

For consonants - before the shock "a".

And how to recognize this?

And now we set the task: automatically (that is, without human intervention) to recognize a certain word, moreover, to make it ponehemically.

Only how will we do this if allophones of one phoneme are different from each other?

The standard method is as follows:
instead of phonemes, pairs and triples of phonemes are taken, called “diphones” (a pair of phonemes) and “triphons” (a triple of phonemes).

Tryphons are better, therefore their use is preferable.

At the same time, the breakdown into triphons goes with imposition, so that each phoneme is in the center of the triphon at least once.

For example, the already mentioned word “MOMA” will be divided into triphons as follows:

sil M A
M A M
AMA
M a sil

Here sil means the beginning or end of a word (from " sil ence" - silence).

Sometimes there is the following entry: M (sil, A). It means triphon, in the center of which is “M”, at the beginning - silence (sil), and at the end - “A.”

So what's the problem?

There are no problems. Just a note: this “design” of the fight against co-articulation was developed for Hidden Markov Models and introduced “not from a good life”.

We will try to use more advanced speech recognition technology - neural networks.

And for neural networks, this “design” fits worse, since neural networks have a “curse of dimension”.

The neural network learns from examples, so the larger the dimension of the data, the more examples are needed for learning. This increases the number of examples very quickly, much faster than the dimension :)

Well, there are dozens of phonemes, allophones - hundreds, and triphons - only about 6000 basic ones.

A neural network can theoretically learn this, only it is difficult and the training data base must be large, and the training time will be huge.

Is there any way around this?

I suggest the following method:

Introduce some function that will measure the similarity of the sound to the main allophone.
Then the INS can only find the degree of similarity of the input portion of the main allophone. And the number of INS outputs will not be equal to the number of possible combinations of allophones, but simply the number of phonemes.

This will allow the word recognition to be divided into a small number of phonemes (for each, if necessary, you can select one neural network).

This approach also allows you to make high-quality contextual analysis (for those who are not up to date - this analysis based on the frequency of certain combinations of phonemes in the language allows you to correct recognition errors).

So, from advertising - to explanations. What does this look like? On practice? Look at the next drawing.

Similarity functions for phonemes B, A, K by the example of the phonogram of the word <b> <i> tank </ i> </ b>

Fig. 2. Similarity functions for phonemes B, A, K by the example of the phonogram of the word tank

As already mentioned, the phonemes interpenetrate each other, affect. And this influence decreases to the center of the phoneme and increases closer to its conditional edges. Thus, the center of any allophone almost completely coincides with the main allophone (and the centers of all other allophones), at the same time, due to the influence of neighboring sounds, the degree of similarity to the main allophone falls closer to the zones of phoneme joints. What can be clearly seen in the figure above.

Now it remains the case for small: to break this function into an arbitrary number of samples (I recommend from 1 to 32) and “feed” the neural network.

In this case, it is not at all necessary to make a clever algorithm for comparing sounds with reference “basic allophones”; it is enough to draw an “on-the-knee” arbitrary function, unit in the center of the desired allophone, decreasing to its edges and zero for all other phonemes and sounds.

PS

Criticism is not only expected, but welcome.
I will be especially grateful for the comments:

1. According to the logic of presentation and its improvement
2. Essentially :)

Thank you very much - if you find something similar or close in the literature - but only for the link (not necessarily electronic) :)

Source: https://habr.com/ru/post/105512/

All Articles