There are several approaches to understanding the conversational machine: a classic three-component approach (includes a speech recognition component, a component of natural language understanding, and a component responsible for some business logic) and an End2End approach, which involves four implementation models: direct, collaborative, multi-step, and multi-tasking . Consider all the pros and cons of these approaches, including those based on Google’s experiments, and analyze in detail why the End2End approach solves the problems of the classical approach.

We give the floor to Nikita Semenov, the leading developer of the AI ​​MTS center.
Hello! As a preface, I want to quote the well-known scientists Jan Lekun, Yoshua Bengio and Jeffrey Hinton - these are the three pioneers of artificial intelligence who have recently received one of the most prestigious awards in the field of information technology - the Turing Prize. In one of the issues of the journal Nature in 2015, they released a very interesting article “Deep learning”, in which was an interesting phrase: “Deep learning”. It is difficult to translate correctly, but the meaning is something like this: “Deep learning has come with the promise of being able to cope with raw signals without the need to manually create signs.” In my opinion, for developers it is the main motivator from all existing.
')
Classic approach
So let's start with the classic approach. When we talk about understanding conversational machine, we mean that we have a certain person who wants to control some services with the help of his voice or has a need for some system to respond to his voice commands with some logic.
How is this problem solved? In the classic version, a system is used, which, as mentioned above, consists of three major components: a speech recognition component, a component of natural language understanding, and a component responsible for some business logic. It is clear that at the beginning the user creates a certain sound signal, which hits the speech recognition component and turns from sound into text. Then the text falls into the component of understanding a natural language, from which a certain semantic structure is pulled out, which is necessary for the component responsible for business logic.

What is the semantic structure? This is a kind of generalization / aggregation of several tasks into one - for ease of understanding. The structure includes three important parts: the classification of the domain (some definition of the subject), the classification of the intent (understanding and what actually needs to be done) and the allocation of named entities to fill in the cards, which are necessary for specific business tasks in the next step. To understand what a semantic structure is, you can consider a simple example, which is most often given by Google. We have a simple request: “Please play some song of an artist”.

The domain and subject matter in this query is music; intent - play a song; “play a song” card attributes - what a song, what an artist. Such a structure is the result of understanding natural language.
If we talk about solving a complex and multi-stage task of speaking, then it, as I said, consists of two stages: the first is speech recognition, the second is the understanding of natural language. The classical approach implies a complete separation of these stages. As a first step, we have a certain model that receives an acoustic signal at the input, and the output, using language and acoustic models and the lexicon, determines the most likely verbal hypothesis from this acoustic signal. This is a completely probabilistic history - you can decompose it using the well-known Bayes formula and get a formula that allows you to write the sample likelihood function and use the maximum likelihood method. We have a conditional probability of the signal X, provided that the word sequence W is multiplied by the probability of this word sequence.

We have passed the first stage - we have received a certain verbal hypothesis from the sound signal. Next comes the second component, which takes this very verbal hypothesis and tries to pull out of it the semantic structure described by us above.
We have the probability of semantic structure S under the condition of the word sequence W at the input.

What is so bad about the classical approach consisting of these two elements / steps that are taught separately (i.e., we first train the model of the first element and then the model of the second)?
- The natural language understanding component works with the high-level verbal hypotheses that ASR generates. This is a big problem because the first component (the ASR itself) works with low-level raw data and generates a high-level verbal hypothesis, and the second component takes the hypothesis as input - not the raw data from the original source, but the hypothesis that the first model gives - and builds its hypothesis above the hypothesis of the first stage. This is a rather problematic story, because it becomes too “conditional.”
- The next problem: we can’t build a link between the importance of the words that are needed to build the very semantic structure and the one that is preferred by the first component when building our verbal hypothesis. That is, if we rephrase, it turns out that the hypothesis is already constructed. It is built on the basis of three components, as I have already said: the acoustic part (what came to the input and is somehow modeled), the language part (completely simulates some language engrams - the probability of speech) and the lexicon (pronunciation of words). These are three big parts that need to be combined and to find some hypothesis in them. But there is no possibility to influence the choice of the very hypothesis so that this hypothesis is important to the next stage (which is basically in the clause that they are trained completely separately and in no way influence each other).
End2End approach
We understood what a classic approach is, what problems it has. Let's try to solve these problems with the help of the End2End approach.
By End2End we mean a model that connects the various components into a single component. We will simulate using models that consist of the architecture of an encoder-decoder containing attention modules. Such architectures are often used in speech recognition tasks and in tasks related to the processing of natural language, in particular, machine translation.
There are four options for implementing such approaches, which could solve the problem of the classical approach set before us: these are direct, cooperative, multistage, and multitasking models.
Direct model
The direct model takes low-level, raw signs as input. low-level audio signal, and at the output we immediately get a semantic structure. That is, we have one module - the input of the first module from the classical approach and the output of the second module from the same classical approach. Just such a “black box”. From here there are some pluses and some minuses. The model does not learn to completely transcribe the input signal - this is an obvious plus, because we do not need to collect large-large markup, we do not need to collect a lot of audio signal, and then give it to assessors on the markup. We only need this audio signal and the corresponding semantic structure. And that's all. This greatly reduces the labor costs for marking data. Probably the biggest disadvantage of this approach is that the task is too complicated for such a “black box”, which tries to solve, conditionally, two problems at once. First, he tries to build a certain transcription inside himself, and then from this transcription to reveal the very semantic structure. Here comes the rather difficult task - to learn how to ignore parts of the transcription. And it is very difficult. This factor is quite large and a huge disadvantage of this approach.
If we talk about probabilities, then this model solves the problem of finding the most likely semantic structure S from an acoustic signal X with model parameters θ.

Joint model
What is the alternative? This is a joint model. That is, some model is very similar to the straight line, but with one exception: the output here already consists of verbal sequences and the semantic structure is simply concatenated to them. That is, we have a sound signal at the input and a neural network model, which at the output already gives both a verbal transcription and a semantic structure.

From the pros: we have a simple encoder, a simple decoder. Learning is facilitated because the model does not attempt to solve two problems at once, as in the case of the direct model. Another advantage can be noted that this dependence of the semantic structure on low-level sound features is still present. Because again, one encoder, one decoder. And, accordingly, one of the advantages can be noted that there is a dependency in the prediction of this very semantic structure and its influence directly on the transcription itself - which did not suit us in the classical approach.
Again, we need to find the most likely sequence of words W and the corresponding semantic structures S from the acoustic signal X with parameters θ.
Multitasking model
The next approach is a multitasking model. Again, the encoder-decoder approach, but with one exception.

For each task, that is, to create a word sequence, to create a semantic structure, we have our own decoder, using one common hidden representation that generates a single encoder. A very famous machine learning trick, very often used in works. Solving two different problems at once helps to much better look for dependencies in the source data. And as a consequence of this - the best generalizing ability, since the optimal parameter is chosen for several tasks at once. This approach is best suited for tasks with less data. And decoders use one hidden vector space into which their encoder creates.

It is important to note that already in probability there appears a dependence on the parameters of the encoder and decoder models. And these parameters are important.
Multistage model
Let's move, in my opinion, to the most interesting approach: a multi-stage model. If you look very closely, you can see that this is actually the same two-component classical approach with one exception.

Here it is possible to establish a connection between the modules and make them single-module. Therefore, the semantic structure is considered conditionally dependent on transcription. There are two options for working with this model. We can train separately these two mini-blocks: the first and second encoder-decoder. Or combine them and train both tasks at the same time.
In the first case, the parameters for the two tasks are not related (we can train using different data). Suppose we have a large corpus of sound and corresponding verbal sequences and transcriptions. We “drive” them, train only the first part. We get in good transcription modeling. Then we take the second part, we train on another case. We connect and get a solution that in this approach is 100% consistent with the classical approach, because we took the first part separately and trained the first part and the second part separately. And then we train the connected model on the corpus, which already contains data triads: the audio signal, the corresponding transcription and the corresponding semantic structure. If we have such a corpus, we can retouch the model trained separately on large enclosures for our specific small task and get the maximum accuracy gain in this tricky way. This approach allows us to take into account the importance of different parts of transcription and their impact on the prediction of the semantic structure by
taking into account the error of the second stage in the first.
It is important to note that the final task is very similar to the classical approach with one big difference: the second member of our function, the logarithm of the probability of the semantic structure, subject to the input acoustic signal X, also depends on the parameters of the
model of the first stage .

It is also important to note here that the second component depends on the parameters of the first and second models.
Method of assessing the accuracy of approaches
Now it is necessary to determine the methodology for assessing accuracy. How, in fact, to measure this accuracy in order to take into account features that do not suit us in the classical approach? There are classic labels for these separate tasks. To evaluate speech recognition components, we can take the classic WER metric. This is Word Error Rate. According to a not very complicated formula, we count the number of inserts, replacements, and permutations of a word and divide them by the number of all words. And we get some evaluation characteristic of the quality of our recognition. For the semantic structure componentwise, we can simply read the F1 score. This is also a kind of classical metric for the classification problem. It's all plus or minus clear. There is completeness, there is accuracy. And this is just the harmonic mean between them.
But the question arises how to measure accuracy when the input transcription and the output argument do not match or when the output is audio data. Google has proposed a metric that will take into account the importance of predicting the first speech recognition component by assessing the effect of this recognition on the second component itself. They called it Arg WER, that is, it is the weighting of WER by the entities of the semantic structure.
Take the query: "Set the alarm for 5 hours." This semantic structure contains an argument such as "five o'clock", an argument of the type "date time". It is important to understand that if the speech recognition component produces this argument, then the error metric of this argument, that is, WER, is 0%. If this value does not correspond to five hours, then the metric has 100% WER. Thus, we simply consider the weighted average value for all arguments and, in general, we obtain an aggregated metric that evaluates the importance of transcription errors, which create a speech recognition component.
I will cite as an example the experiments of Google, which she conducted in one of her studies on this topic. They used data from five domains, five topics: Media, Media_Control, Productivity, Delight, None - with the corresponding distribution of data on training test data sets. It is important to note that all models were trained from scratch. The cross_entropy was used, the beam search parameter was equal to 8, they used the optimizer, of course, Adam. They considered, of course, on a large cloud of their TPU. What was the result? These are interesting figures:

To understand, Baseline is a classic approach, which consists of two components, as we said at the very beginning. The following are examples of direct, coupled, multi-tasking, and multistage models.
How much are two multistage models? Just at the junction of the first and second parts used different layers. In the first case - this is ArgMax, in the second case - SampedSoftmax.
What you should pay attention to? The classical approach loses on all three metrics, which are an estimate directly of the joint work of these two components. Yes, we are not interested in how well the transcription is done there, we are only interested in how well the element that predicts the semantic structure works. It is evaluated by three metrics: F1 - by subject, F1 - by intent and by ArgWer metric, which is considered according to the arguments of the entities. F1 is considered a weighted average between accuracy and completeness. That is, the standard is 100. ArgWer - on the contrary, it is not a success, it is a fallacy, that is, here the standard is 0.
It is worth noting that our connected and multi-tasking models completely benefit all models of the classification of topics and intents. A model that is multistage, has a very large increase in the final ArgWer. Why is it important? Because in tasks related to understanding conversational speech, it is important that the final action be performed in the component responsible for the business logic. It does not directly depend on the transcriptions created by ASR, but on the quality of the work of the components of the ASR and NLU together. Therefore, the difference in almost three points of the argWER metric is a very steep indicator that indicates the success of this approach. It is also worth noting that all approaches have comparable meanings for defining topics and intents.
I will give a couple of examples of the use of such algorithms for understanding conversational speech. Google, when talking about the tasks of understanding conversational speech, first of all notes the human-computer interfaces, that is, these are all sorts of virtual assistants like Google Assistant, Apple Siri, Amazon Alexa and so on. As a second example, it is worth noting such a task pool as Interactive Voice Response. That is, it is a kind of story that automates call-centers.
So, we looked at approaches with the ability to use collaborative optimization, which helps the model focus on errors that are more significant to the SLU. Such an approach to the task of understanding a spoken language greatly simplifies the overall complexity.
We have the opportunity to perform a logical inference, that is, to obtain some kind of result, without the need for additional resources such as the lexicon, language models, analyzers, and so on (that is, all these factors that are inherent in the classical approach). The problem is solved "directly."
In fact, you can not stop there. And if now we have combined two approaches, two components of a common structure, then it is possible to aim a blow at more. Combine both three components and four - just continue to combine this logical chain and “drag” the importance of errors one level lower, given the criticality there already. This will allow us to increase the accuracy of solving the problem.