Problems of interpretation of voice input - how it works for us

In the footsteps of publications “Recognition of Russian language for call centers and paranoids” and “Elena,“ electronic girl ”from the support service,” as well as comments to them, I decided to tell how we solve the problem of interpreting voice input in our interactive system.

First let me show you a short video from our prototyping and dialog development interface, made specifically for the article. He was shot based on comments on the publication of Megaphone (see, if possible, in the resolution of 720+):

')
I would like to note that the system does not require any preliminary preparation - from the video, I hope, it is obvious that I am creating and testing the dialogue on the fly.

The main differences of our approach to creating dialogs

First, we do not force the dialog designer to list all possible recombination of the pronunciation of phrases. The search is carried out in the semantic space, in which words and expressions close in meaning are located nearby. The video demonstrates this well - I used the words “Non-working”, “Roaming” and “Ireland”, although I said “there is no roaming in Dublin”. Or, I wrote "Your name", although I asked "what is your name." The system recognizes stable expressions well - for example, the expression “cellular operator” is very close to the expression “mobile operator”. For morphologically rich languages, we normalize compared expressions.

Secondly, the system is pure WYSIWYG. The client hears exactly what the dialogue designer hears; The code is the same for all versions. Time to market of any change is literally seconds. For example, if a provider has a router, then a new question-answer, explaining the situation to clients, can be added right on the fly.

Thirdly, questions can be asked both “in depth” of the dialogue, and in the opposite direction. Those. we try to imitate live communication, where there is no unambiguous "current state" (say "yes" or "no") and you can ask any question at any time.

Fourth, we maintain the current context. Those. if there are identical questions in two branches, say, the cost of SMS, and I asked something in the spirit of “the cost of SMS in the Rain tariff”, then I can ask a clarifying question - “what about the Rainbow tariff?”, which in essence will be equivalent to the question "The cost of SMS in the tariff Rainbow."

How it works?

For each supported language, we build a vector semantic model - unsupervised word representation. Very common and well-proven approach. Set expressions are trained along with individual words. At the moment, we do not divide words into multiple prototypes by value — somehow they have not yet reached their hands, and in conversational systems the number of domains used is usually much less than in the wild.
The graph with questions specified by the user (we call them "canonical questions") is converted into a set of vectors.
For each question asked, we look for a suitable combination of vectors using a relatively simple priority heuristics. At this stage, we take into account the current context based on the previous questions, and also compare the question with the hierarchy “in depth” and back.

And such a relatively simple scheme works well enough even for large graphs. In general, with the right approach to structuring, it is quite easy to build stable dialogues with such a scheme.

Of the main problems I would like to mention the problems with speech recognition. For dialogue systems, in our opinion, decoding voices into text and subsequent processing as text is in principle flawed, since important details are lost, such as pauses and intonations in words.

That is why we are now developing an ASR that will work as a plug-in to our semantic processor and decode sound directly into semantic vectors. One of the advantages of this scheme will be the ability to use adaptive grammars. Indeed, the conversational system always knows which dictionary it understands. Therefore, a natural means of dealing with the growth of a tree of hypotheses and, accordingly, with inaccuracies in recognition, is the prioritization of a specific dictionary (grammar). But such grammars, obviously, for a less practical dialogue system, would still contain thousands of words. Plus the difficulty with their manual construction. And if also for a morphologically rich language ...

An adaptive grammar uses a semantic language model for these purposes. If one of the expected words is, for example, the word “price”, then it is obvious that the word “cost” will also do. Or the word "fare". In general, the same approach is used as to search for a suitable “canonical question”.

Is language processing the future?

We have to admit that our current approach to the language is not “understanding of the language,” in the sense that people “understand” it. Although vector semantic models, it seems to us, quite successfully simulate some parts of this process.

To date, we have begun experimenting with a mathematical approach to the general theory of language, sponsored by Sellig Harris. Perhaps one of you will be inspired by his theory of operator grammar. It seems to me very, very interesting, especially in the context of interactive systems.

Another area in which we are experimenting is a single semantic space for all supported languages. In practice, this means compatible with each other semantic models. Those. After being transformed into a vector form, I absolutely do not care about the original language of the text expression. Thus, I can, for example, search in a document using a language that is convenient for me, and not a document language.

From the point of view of the development of the dialogue system, a single semantic space will allow to specify “canonical questions” only once. Those. for example, the system, although it answers in Russian, is able to understand the question from a speaker of the Belarusian language.

Practice

But in practice, if you suddenly wanted to play with our beta, throw a letter at me at tridemax@sapiensapi.com and I will give you access to the system. From the toolkit you only need the Chrome browser. Sorry in advance if there will be some delay with the answer - we will have to build a certain queue in order not to kill our favorite server.

Source: https://habr.com/ru/post/235763/

All Articles

Problems of interpretation of voice input - how it works for us

The main differences of our approach to creating dialogs

How it works?

Is language processing the future?

Practice

More articles: