Asterisk + UniMRCP + VoiceNavigator. Synthesis and speech recognition in Asterisk. Part 4

Part 1
Part 2
Part 3

In the fourth part, as promised, consider the limitations and disadvantages of Asterisk as a voice platform and features of its interaction with VoiceNavigator .
The voice platform is a hardware-software complex with speech synthesis and speech recognition functions that allows you to create solutions aimed at optimizing the work of the contact center, improving customer service and building voice self-service systems.
')
Asterisk voice platform in its pure form is not, because Work with synthesis and recognition servers is not its main task and is implemented via UniMRCP . Therefore, part of the shortcomings and limitations are related precisely to the implementation of work in UniMRCP.

Among the large and well-known voice platforms used in call centers for building GHS, it is worth noting Avaya Voice Portal, Genesys Voice Platform, Siemens OpenScape CCVP, Cisco Unified CCX, Voxeo, Voxpilot.
There is very little information in RuNet, as well as specialists in these products with an appropriate level of competence. Used in large call centers, the cost is appropriate, and it is available only for large companies.
If there is interest in one of the platforms, I am ready to write about them separately.

The two main drawbacks that exist in Asterisk + UniMRCP are the lack of VoiceXML support and the idle barge-in for speech synthesis.

Lack of VoiceXML support

The industry standard for creating voice applications is the use of the VoiceXML language, which is not supported in UniMRCP and Asterisk.

VoiceXML (Voice Extensible Markup Language) is a language designed for creating scripts of audio dialogs that can synthesize speech, translate voice into digital form, recognize DTMF, SGRS grammar.
VXML language is a lot like HTML. VoiceXML page may contain text that should
be voiced by a synthesized voice, pre-recorded sound files, as well as forms,
allowing to perceive and recognize speech information. Depending on the spoken
by the user of the command or DTMF input, a transition to other VoiceXML pages may occur.

VXML is what is used in all voice platforms and greatly facilitates the creation of applications.
Here is an example of an application that asks to name a number from zero to ten and repeats a named number. Try to do the same in Asterisk and compare)

  <? xml version = "1.0" encoding = "UTF-8"?>
 <vxml version = "2.0" xmlns = "http://www.w3.org/2001/vxml" xml: lang = "en-RU">
        <form id = "digits">
            <property name = "bargein" value = "false" />
              <field name = "digitfield">
                 <prompt bargein = "false"> Please state a number from zero to ten. </ prompt>
                 <noinput> You did not give any number.  Please state the number from zero to ten. </ Noinput>
                 <nomatch> Sorry, could not recognize the said number, please repeat. </ nomatch>
                 <grammar type = "application / srgs + xml" version = "1.0" mode = "voice" root = "boolean" xml: lang = "en-RU">
                    <rule id = "boolean" scope = "public">
                       <one-of>
                          <item> zero </ item>
                          <item> one </ item>
                          <item> two </ item>
                          <item> three </ item>
                          <item> four </ item>
                          <item> five </ item>
                          <item> six </ item>
                          <item> seven </ item>
                          <item> eight </ item>
                          <item> nine </ item>
                          <item> ten </ item>
                          <item> <ruleref special = "GARBAGE" /> </ item>
                       </ one-of>
                    </ rule>
                 </ grammar>
                 <filled>
                    <prompt> You called a number <value expr = "digitfield" />.</ prompt>
                    <goto next = "# digits" />
                 </ filled>
              </ field>
        </ form>
 </ vxml>

For Asterisk, there is a VXI * project from i6net . This is VoiceXML Browser - VoiceXML interpreter for OpenVXI based Asterisk .
My attempts to make it work with success were not crowned. UniMRCP is used for recognition, and a custom HTTP-TTS connector, which is developed for each specific engine, is used for synthesis.
In addition, the product is paid .
If someone has experience with VXI * or OpenVXI, I will be very happy for your comments.

No barge-in for synthesis

In previous articles, I talked about barge-in and the f parameter of the MRCPRecog function. As this parameter, you can send a sound file that will play as a greeting. If the parameter b = 1 is also set, the sounding of the file will be interrupted at the beginning of the speech and the recognition session will start.
Barge-in support is a mandatory feature required when building voice applications. For example, in the case of a long message or when a subscriber calls not the first time and knows the system messages, you do not need to force him to listen to the message to the end. And in general, we love to interrupt others, especially the robot))

The MRCPSynth function does not support barge-in. Those. A dynamically generated message, for example, a system response to a previous subscriber selection, cannot be interrupted.
Synthesis can be interrupted by DTMF, but in this case it is not a solution.

As I see it, with the implementation selected in UniMRCP, when synthesis and recognition are started by different sessions, the presence of barge-in for synthesis will not completely solve the problem. Even if you interrupt the synthesis with a voice for the subsequent launch of the recognition session that goes after the synthesis, it will take time. And at this time, the first few seconds of the phrase said by the subscriber will not fall, which may impair the quality of recognition. There is no such problem in VoiceXML, there barge-in works for both prerecorded phrases and synthesis. The recognition session starts simultaneously with the synthesis session, more precisely, even before it, and the voice platform is already ready to receive and transmit voice data to the MRCP server (see example above).

In the earlier application, it is enough to change bargein = “false” to bargein = “true” , so that the synthesized speech can be interrupted.

A way to get around this flaw a little and deceive barge-in

The main way to get around this drawback is to minimize the amount of synthesis and, where static phrases are used, try to use pre-recorded files, especially at the end of the phrase.
Because People often begin to speak, without having finished listening to the phrase to the end, the phrase must be built so that the last part is pre-recorded and can be interrupted. Even if it will be one word.
For example, if we clarify a phrase said by a person (the value in quotation marks is the phrase recognized at the previous stage).
- You said "sales department"? (synthesis)
- It is right? (recorded phrase)

- You have chosen “account operations” (synthesis).
- Confirm? (Recorded phrase)

An example of an application that reports the cost of travel and point A to point B.

  ; Use the pre-recorded file.  Ask: Say the city of departure.
 exten => app, n (level1), MRCPRecog ($ {GRAMMARS_PATH} /towns.xml,ct=0.1&b=1&f = $ {SND_PATH} / level1)
 exten => app, n, SET (RECOG_HYP_NUM = 0)
 exten => app, n, SET (RECOG_UTR0 = error)

 exten => app, n, AGI ($ {AGI_PATH} /NLSML.agi, $ {QUOTE ($ {RECOG_RESULT})})

 ; If the city is not recognized, go to the error_start priority, otherwise go to check_start
 exten => app, n, GotoIf ($ ["$ {RECOG_UTR0}" = "error"]? error_start: check_start)

 ; Run the AGI script, where we check the city, something we think, etc.
 exten => app, n (check_start), AGI ($ {AGI_PATH} /check_start.agi)

 ; The first part of the phrase uses the synthesis with the result of recognition 
 exten => app, n, MRCPSynth (<? xml version = \ "1.0 \"?> <speak version = \ "1.0 \" xml: lang = \ "en-ru \" xmlns = \ "http: // www .w3.org / 2001/10 / synthesis \ "> <voice name = \" Maria8000 \ "> <p> You chose $ {Start_Town}. </ p> </ voice> </ speak>)

 ; In the second part - the pre-recorded file: Name the destination
 exten => app, n (level2), MRCPRecog ($ {GRAMMARS_PATH} /towns.xml,ct=0.1&b=1&f = $ {SND_PATH} / level2)
 exten => app, n, SET (RECOG_HYP_NUM = 0)
 exten => app, n, SET (RECOG_UTR0 = error)

 exten => app, n, AGI ($ {AGI_PATH} /NLSML.agi, $ {QUOTE ($ {RECOG_RESULT})})

 ; If the city is not recognized, go to the priority error_finish, otherwise we go to check_finish
 exten => app, n, GotoIf ($ ["$ {RECOG_UTR0}" = "error"]? error_start: check_finish)

 ; Run the AGI script, where we check the city, we consider the fare
 exten => app, n (check_finish), AGI ($ {AGI_PATH} /check_finish.agi)

 ; Report the results of calculations 
 exten => app, n, MRCPSynth (<? xml version = \ "1.0 \"?> <speak version = \ "1.0 \" xml: lang = \ "en-ru \" xmlns = \ "http: // www .w3.org / 2001/10 / synthesis \ "> <voice name = \" Maria8000 \ "> <p> The fare from $ {Start_Town} to $ {Finish_Town} is $ {Price}. </ p> </ voice> </ speak>)

256 character variable length limit

Asterisk has a variable length limit of 256 characters. If the variable is larger than the specified size, then the excess part is simply discarded.
So far I have encountered this limitation in two cases.

1) It is necessary to transfer a string of more than 256 characters to the synthesis of VoiceNavigator.
In the voice menu, all phrases should be short and succinct, but sometimes it is necessary, for example, to voice background information that may exceed this limit. In this case, you have to break the text into parts of not more than 256 characters and voice parts.

Perhaps you can find and recompile Asterisk with a large value of this parameter.
So far I was able to find in the source only how to change the size for the variables passed in the call-files, but not in the whole Asterisk. If there are craftsmen, again, I will be glad to hear your experience.

2) It is necessary to process N-Best more than 2.
N-Best is the number of recognition results returned. Default is N-Best = 1. You can increase it and then VocieNavigator will return several recognition results with the highest Confidence Level (recognition accuracy).
For example, you can take the two most reliable recognition results and if the difference in Confidence Level between them is very low, then ask the subscriber to clarify what he said:
“Excuse me, did you say the sales department or the space research department?”

If N-Best> 2 and the MRCP server returns more than two responses, then the NLSML response is more than 256 characters, the end of it is truncated, as a result, the parser cannot parse the NLSML, because it becomes not valid.

This case is very rare, usually N-Best = 2 is enough, but remember this problem is worth it.

DTMF recognition does not work

Today, VoiceNavigator cannot understand DTMF-signals from UniMRCP and, accordingly, work with DTMF grammars. The developers promise to solve the problem in the near future.

DTMF is recognized directly by Asterisk itself (this is its standard feature), but it is impossible to prompt the user to say the phrase OR enter the extension number, i.e. it is not possible to recognize both DTMF and voice in the same menu branch in the same thread.
This is a serious limitation and should be taken into account when building a voice menu.
In the VoiceNavigator bundle with other platforms, this problem is absent.

Here are the main problems and limitations that have been encountered when working in conjunction with Asterisk + UniMRCP + VoiceNavigator.

Ready to answer your questions.

PS: The MDGs have a test VoiceNavigator that looks to the Internet. If you are interested in the technologies described in the articles, write in a personal, I will give contacts and instructions on where and what to write, so that you will be given access for testing.

Source: https://habr.com/ru/post/128898/

All Articles