Asterisk + UniMRCP + VoiceNavigator. Synthesis and speech recognition in Asterisk. Part 2

Part 1
Part 3
Part 4

In the previous article , the general scheme of work was described, UniMRCP was installed, Asterisk was connected to VoiceNavigator and a simple voice application was created.
Let us dwell on the possibilities of synthesis and recognition. The first part of the article will be devoted to the SSML markup language, the second - the construction of grammars.

Using SSML Markup Language

Management of speech synthesis at the linguistic and acoustic level, occurs with the help of control tags in SSML format.
Using tags, you can define pronunciation, control intonation, speed and volume of sound, etc.
I will describe only the most used tags.
Detailed information on all tags can be obtained in the description of the standard .

Tag voice

Allows you to change voice.
Today, VoiceNavigator has 5 votes: Maria, Anna, Lydia, Alexander, Vladimir.
Example:

 <voice name=\"8000\">  .</voice>

Say-as tag

Determines how to read the enclosed expression in a tag.

 <say-as   >  </say-as>

The information part of the tag is a list of the form:
Tag Attribute = "Attribute Value"
There can be no more than one tag per word. Nesting of say-as tags is prohibited.
')

Stress attribute

Sets the number of the vowel on which the main stress is placed on the word.
Numbering is considered to be a vowel of a word starting from 1.
Format:

 <say-as stress=" ">  </say-as>

In case of inconsistency of the attribute value with the number of vowels in the word, the attribute will be
ignored.
Example:

 <say-as stress="2">  </say-as>

Instead of the word “ko <mpas”, the word “company <” will be processed.

Interpret-as attribute

“Date” value
Sets the date value in the Gregorian style. The text inside the tag is set as numeric delimited fields. The separator can be a period, hyphen, colon or slash. In this case, the format attribute is required, the value of which is one of the following lines:
“Mdy” - month, day, year “ym” - year, month
“Dmy” - day, month, year “my” - month, year
"Ymd" - year, month day "m" - month "
"Md" - month, day "d" - day
“Dm” - day, month “y” - year
Examples:

 <say-as interpret-as="date" format=\"dy\"> 3/02 </say-as>

- “March two thousand and two”

 <say-as interpret-as="date" format=\"mdy\"> 3/6/02 </say-as>

- “the sixth of March, two thousand and two years”

Time value
Sets the time value. The text inside the tag is specified as numeric fields with or without delimiters, in the sequence: hours, minutes, seconds. The separator can be a period, hyphen, colon or slash.
Examples:

  <say-as interpret-as=\"time\">2230</say-as>

- “after twenty two”

  <say-as interpret-as=\"time\"> 9:21:30 </say-as>

<say-as interpret-as=\"time\"> 9:21:30 </say-as> - "it's nine o'clock twenty-one minute thirty seconds"

Value "telephone"
Sets the reading of a given word or group of words as numbers or phone numbers. The word phone number may contain the “+” sign and parentheses. The number is read as a numeral in the nominative case. In this case, the number is split into two- and three-digit numbers. Non-word words that fall within the scope of the tag are processed in the usual way.
Example:

 <say-as interpret-as=\"telephone\"> +7 (812) 1234567  2345</say-as>

“Characters” value
Sets the spelling of a given word or group of words. In this case, the letters are read as alphabetic, numerals in numbers, as quantitative in the nominative case, special characters and punctuation marks are replaced by the corresponding words. Uppercase and lowercase letters do not differ.
Example:

 <say-as interpret-as=\”characters\”> 2a24-B!Z?#7X </say-as>

- “BE two and two four hyphens bi exclamation mark zet X question mark lattice seven X”

Break tag

Add a pause of a specified duration or type.
Attributes:
strength - “expressiveness” of the pause. Valid values are “none”, “x-weak”, “weak”, “medium” (default), “strong”, “x-strong”.
time - pause duration in milliseconds.
Attributes strength and time can be specified simultaneously, while strength only affects intonation, and time on the duration of the pause.
Example of setting a pause length of 3.6 seconds:

    . <break time=\"3600\"/>   ?

Prosody tag

Allows you to control the tone, speed and volume of speech.
Attributes (all optional)
pitch - the average value of the tone.
Valid values are from 0.5 to 2.
rate - speech speed. Valid values: relative change or one of the values “x-slow”, “slow”, “medium”, “fast”, “x-fast”, “default”.
volume - volume. Valid values: relative change or one of the values “silent”, “x-soft”, “soft”, “medium”, “loud”, “x-loud”, “default”.
Example:

 <prosody volume=\"25\" rate=\"x-slow\">  !</prosody>

Phoneme tag

The tag provides a phoneme transcription.
Attributes:
ph - transcription, required attribute;
The alphabet is the alphabet used to set phonemes. VoiceNavigator supports the IPA alphabet .
Example:

 <phoneme alphabet="ipa" ph="   IPA (    )">

Building SRGS Grammar

SRGS (speech recognition grammar specification) - W3C standard that describes the structure of the grammar used in speech recognition. SRGS allows you to specify words or phrases that can be recognized by the speech engine.

The basic structure of the SGRS grammar is shown below:

 <?xml version="1.0" encoding="UTF-8"?> <grammar version="1.0" xmlns="http://www.w3.org/2001/06/grammar" mode="voice" xml:lang="ru-RU" root="velo"> <rule id="velo">  </rule> </grammar>

The entire SGRS document is described in the grammar tag.
The grammar contains the speech recognition rules described in the rule tag and each rule has a unique name within the grammar specified by the id attribute. The rule with which recognition starts in a grammar is specified by the root attribute in the grammar tag.

Alternative pronunciation of

Alternative options allow you to recognize one word from a given set.
Alternatives are set by one-of and item tags.
For example:

 <?xml version="1.0" encoding="UTF-8"?> <grammar version="1.0" xmlns="http://www.w3.org/2001/06/grammar" mode="voice" xml:lang="ru-RU" root="velo"> <rule id="velo"> <one-of> <item></item> <item></item> <item></item> </one-of>  </rule> </grammar>

Such an “extended” grammar allows you to recognize pronouncings: “red bicycle”, “green bicycle” or “blue bicycle”.
The item element can contain any tag that describes the SGRS grammar rule, including word sequences or one-of element.
A more complete version of the grammar may contain weight in each alternative branch.
recognition. Weight is set using the weight tag of the item element.

Corrected

Rules may contain links to other rules. References to other rules are set by the ruleref element and serve to use the same rule in different places of the grammar.
In the following example, select the color of the bike into a separate sub-rule:

 <?xml version="1.0" encoding="UTF-8"?> <grammar version="1.0" xmlns="http://www.w3.org/2001/06/grammar" mode="voice" xml:lang="ru-RU" root="velo"> <rule id="velo"> <ruleref uri=”#color”/>  </rule> <rule id="color"> <one-of> <item weight="50”></item> <item></item> <item></item> </one-of> </rule> </grammar>

Special rules

SGRS grammar has special rules for which names are reserved: NULL and GARBAGE .
Special rules are specified by the special attribute of the ruleref element.
The NULL rule is triggered automatically if the user says nothing.
The GARBAGE rule (using the “average speech” model) allows you to create so-called “open grammars”. those. distinguish words from grammar and "junk".
For example:

 <?xml version="1.0" encoding="UTF-8"?> <grammar version="1.0" xmlns="http://www.w3.org/2001/06/grammar" mode="voice" xml:lang="ru-RU" root="velo"> <rule id="velo"> <ruleref special =”GARBAGE”/>  <ruleref special =”GARBAGE”/>  </rule> </grammar>

You can say “I want a mountain red bike” and the system will return “mountain bike” as a result of recognition.

VoiceNavigator grammar analyzer also supports its own ! SYLLABLES rule, which implements the connection of the “syllable model” of speech.
The use of the syllable model in many cases makes it possible to reduce the likelihood of false positives and to obtain a benefit in the reliability of clipping not provided by the grammar of utterances, “garbage”.

Semantic interpretation

Semantic interpretation is a mechanism that allows you to set a value for a recognized word, which can then be used in the logic of a voice application.
For example: you can say “yes”, “good” or “agree”, but the semantic result of these words is the same.
In SGRS grammar, semantics is specified by the tag element. The content type of this element is specified by the tag-format attribute of the grammar element. The Semantic Interpretation for Speech Recognition specification defines standard values for the tag-format attribute: semantics / 1.0-literals and semantics / 1.0.
The syntax of the semantics / 1.0-literals tag element is a simple string.
The semantics / 1.0 type is a more powerful tool in the form of a scripting language. In this case, the tag element contains the ECMAScript language code.
Example:

 <?xml version="1.0" encoding="utf-8"?> <!-- / --> <grammar xml:lang="ru-RU" root="da-net" mode="voice" version="1.0" xmlns="http://www.w3.org/2001/06/grammar" tag-format="semantics/1.0-literals"> <rule id="da-net"> <one-of> <item><tag>yes</tag></item> <item><tag>yes</tag></item> <item><tag>yes</tag></item> <item><tag>yes</tag></item> <item><tag>yes</tag></item> <item><tag>no</tag></item> <item><tag>no</tag></item> <item><tag>no</tag></item> <item> <tag>no</tag></item> <item><tag>back</tag></item> </one-of> </rule> </grammar>

This completes the basic theoretical part. I hope that turned out not very tedious. In the next part I will consider some practical case and show how synthesis and recognition make life easier when building voice menus.

Source: https://habr.com/ru/post/125147/

All Articles