ASR and TTS technologies for an application programmer: theoretical minimum

Introduction

In the past few years, voice interfaces surround us more and more tightly. What was once shown only in films about the distant future turned out to be quite real. It has already come down to embedding engines for synthesis (Text To Speech - TTS) and recognition (Automatic Speech Recognition - ASR) of speech into mobile phones. Moreover, quite accessible APIs have appeared for embedding ASR and TTS into applications.

Nowadays, anyone can create programs with a voice interface (who did not buy money for the engine). Our review will focus on the use of existing engines (for example, Nuance) and not the creation of those. There will also be given general information necessary for every programmer who first encounters speech interfaces. The article may also be useful to project managers who are trying to assess the feasibility of integrating voice technology into their products.
So, let's begin…

But for the seed - anecdote:
The lesson of the Russian language in the Georgian school.
The teacher says: “Children, remember: the words of salt, bean and noodles are written with a soft sign, and the words forks, brook, and a plate - without a soft sign. Children, remember, because it is impossible to understand! ”
')
Previously, this joke seemed funny to me. Now - rather vital. Why is that? Now I will try to explain ...

1. Phonemes

Speaking about speech (already funny), we first of all have to deal with the concept of phonemes. Simply put - a phoneme is a separate sound that can be pronounced and recognized by a person. But such a definition is of course not enough, for you can utter a lot of sounds, and the set of phonemes in languages is limited. I want to have a stricter definition. So - you need to go to the bow to the philologists. Alas, philologists themselves can not agree on what it is (yes, they don’t really need it), but they have several approaches. One links phonemes with meaning. For example, the English Wiki tells us "The smallest contrastive linguistic unit." others - with features of perception. So, our compatriot N. Trubetskoy wrote “Phonological units that, from the point of view of a given language, cannot be decomposed into shorter phonological units following each other, we call phonemes”. Both in that and in another definition there are important clarifications for us. On the one hand, a change in the phoneme can (but is not required to) change the meaning of a word. So, "code" and "cat" - will be perceived as two different words. On the other hand, you can pronounce the “museum” or “musei” and the meaning will not change. Is that your interlocutors will be able to somehow classify your accent. The indivisibility of phonemes is also important. But, as Trubetskoy correctly noted, it may depend on the language. Where a person of the same nationality hears one sound, someone else can hear two following each other. It would be desirable, however, to have phonetic invariants suitable for all languages, and not just any one.

2. Phonetic alphabet

To somehow settle the definitions back in 1888, the international phonetic alphabet (IPA) was created. This alphabet is good because it does not depend on a particular language. Those. calculated as if on a "superman" who can pronounce and recognize the sounds of almost all existing living (and even dead) languages. The IPA alphabet gradually changed up to the present day (2005). Since it was created mainly in the pre-computer era, philologists drew symbols denoting sounds as God would put on the soul. They certainly somehow focused on the Latin alphabet, but very, very conditional. As a result, the IPA characters are now in Unicode, but typing them from the keyboard is not easy. Here the reader may ask - why do I need an IPA for ordinary people? Where can I see at least examples of phonetically written words? My answer is that the simple IPA is not necessary to know. But, with all this, you can see it very easily - in many Wiki articles on geographical names, surnames and proper names. Knowing the IPA, you can always verify the correctness of the pronunciation of a particular name in a language you do not know. For example, want to say "Paris" as a Frenchman? Here you are - [paʁi].

3. Phonetic transcription

An attentive user of a Wiki can really notice that sometimes strange phonetic icons stand inside square brackets - [mɐˈskva], and sometimes - inside slashes (slashes) - / ˈlʌndən /. What is the difference? In square brackets is recorded so-called. narrow, that is to say “narrow” transcription. In Russian literature, it is called phonetic. In slashes, the word is broad; "Broad" or phonemic transcription. The practical meaning is as follows: phonetic transcription gives an extremely precise pronunciation, which is in some sense ideal and independent of the speaker's accent. In other words - having a phonetic transcription, we can say "Cockney will pronounce this word like that." Phonemic transcription allows variations. Thus, Australian and Canadian English pronounced sound with the same recording in // may be different. In truth, even a narrow transcription is still not straightforward. Those. pretty far from waw-file. Male, female and children's voices will pronounce the same phoneme in different ways. Also not taken into account the overall speed of speech, its volume and basic pitch of the voice. Actually, these differences make the task of speech generation and recognition non-trivial. Hereinafter, I will always use IPA in narrow transcription, unless otherwise specified. In this case, I will try to reduce the direct use of IPA to a reasonable minimum.

4. Languages

Each living natural language has its own set of phonemes. More precisely, this is a property of speech, because generally speaking, one can know a language without having the ability to pronounce words (how deaf-dumb people learn the language). The phonetic composition of languages is different, about the same way as alphabets are different. Accordingly, the phonetic complexity of the language varies. It consists of two components. First, the complexity of the transformation of graphemes into phonemes (we remember that the English write “Manchester” and read “Liverpool”) and the difficulty of pronouncing the sounds (phonemes) themselves secondly. How many phonemes does a language usually contain? A few dozens. Since childhood, we have been taught that Russian pronunciation is as simple as three kopecks, and everything reads as spelled, unlike European languages. Of course we were deceived! If you read the words literally as they are written, they will understand you, but not always correctly. But they really won't count Russians. In addition, it comes in such a terrible thing for a European as stress. Instead of putting it at the beginning (like the British) or at the end (like the French), it walks around the whole word like a god for the soul, changing the meaning. D o rogi and dori gi are two different words, and even parts of speech. How many phonemes in Russian? Nuance has 54 of them. For comparison, in English there are only 45 phonemes, and in French even less - 34. No wonder the aristocrats considered it easy to learn the language a couple of centuries ago! Of course - Russian is not the most difficult language in Europe, but one of (note, I am still silent about grammar).

5. X-SAMPA and LH +

Since people wanted to introduce phonetic transcriptions from the keyboard for a long time, even before the widespread use of Unicode, notations were developed that allow them to get along with the characters of the ASCII table. The two most common of these are X-SAMPA - the work of Professor John Wells, and LH + - the internal format of Lernout & Hauspie , the technologies of which were later purchased by Nuance Communications. There is a rather significant difference between X-SAMPA and LH +. Formally, X-SAMPA is just a notation, which allows for certain rules to write the same IPA phonemes, only using ASCII. Other business LH +. In a sense, LH + is an analogue of broad (phonemic) transcription. The fact is that for each language, the same LH + symbol can denote different phonemes IPA. On the one hand, this is good, because the record is shortened, and there is no need to encode all possible IPA characters, on the other hand, ambiguity arises. And each time to broadcast to IPA you need to keep in front of a correspondence table. However, the saddest thing is that the string recorded in LH + can correctly pronounce only “voice” for a specific language.

6. Voices

No, it will not be about the voices that programmers often hear in their heads, who have written too much bad code in the past. Rather, those who are so often looking for on trackers and file dumps owners of navigators and other mobile devices. These voices have even names. The words "Milena" and "Katerina" say a lot to the experienced user of voice interfaces. What is it? Roughly speaking, these are data sets prepared by various companies (such as Nuance) that allow a computer to convert phonemes into sound. Voices are feminine and masculine, and cost a lot of money. Depending on the platform and the developer company, you may be required to have 2-5 thousand dollars per voice. Thus, if you want to create an interface in at least 5 of the most common European languages, then the bill can go to tens of thousands. Of course, we are talking about the software interface. So the voice is language specific. From here also occurs its binding to phonetic transcription. It is not easy to understand in the first place, but the anecdote at the beginning of the article is true. People with one native language are usually simply not able to pronounce phonemes of another that are not in their native language. And, even worse, not only individual phonemes, but also certain combinations of them. So, if in your language a word never ends with a soft "l", then we will not be able to pronounce it (at first).

The same with voices. The voice is sharpened to pronounce only those phonemes that are available in the language. Moreover, in a particular dialect of the language. Those. Voices for Canadian French and French French will not only differ in sound, but also have a different set of pronounced phonemes. This, by the way, is convenient for manufacturers of ASR and TTS engines, since Each language can be sold for some money. On the other hand - you can understand them. Voice creation is quite laborious and costly in terms of money. Perhaps this is why there is still no wider market for open source solutions for most languages.

It would seem that nothing prevents to create a “universal” voice that will be able to pronounce all the IPA phonemes, and thus will solve the problem of multilingual interfaces. But for some reason nobody does it. Most likely, this is impossible. Those. he can and will speak, but every native speaker will be dissatisfied with the insufficient “naturalness” of pronunciation. It will sound like Russian in the mouth of a little Englishman practicing or English in the mouth of a Frenchman. So, if you want multilingualism - get ready to fork out.

7. Example of using TTS API

To give the reader an idea of how the process of working with TTS looks at the lower level (using C ++), I will give an example of speech synthesis based on the Nuance engine. Of course, this is an incomplete example, it cannot be not only launched but even compiled, but it gives an idea of the process. All functions except TTS_Speak () are needed as a “binding” for it.

TTS_Initialize () - used to initialize the engine
TTS_Cleanup () - for deinitialization
TTS_SelectLanguage - selects the language and adjusts the recognition parameters.

TTS_Speak () - actually generates sound samples
TTS_Callback () - called when the next portion of audio data is ready to play as well as in the case of other events.

TTS and strapping to it

static const NUAN_TCHAR * _dataPathList[] = { __TEXT("\\lang\\"), __TEXT("\\tts\\"), }; static VPLATFORM_RESOURCES _stResources = { VPLATFORM_CURRENT_VERSION, sizeof(_dataPathList)/sizeof(_dataPathList[0]), (NUAN_TCHAR **)&_dataPathList[0], }; static VAUTO_INSTALL _stInstall = {VAUTO_CURRENT_VERSION}; static VAUTO_HSPEECH _hSpeech = {NULL, 0}; static VAUTO_HINSTANCE _hTtsInst = {NULL, 0}; static WaveOut * _waveOut = NULL; static WaveOutBuf * _curBuffer = NULL; static int _volume = 100; static int _speechRate = 0; // use default speech rate static NUAN_ERROR _Callback (VAUTO_HINSTANCE hTtsInst, VAUTO_OUTDEV_HINSTANCE hOutDevInst, VAUTO_CALLBACKMSG * pcbMessage, VAUTO_USERDATA UserData); static const TCHAR * _szLangTLW = NULL; static VAUTO_PARAMID _paramID[] = { VAUTO_PARAM_SPEECHRATE, VAUTO_PARAM_VOLUME }; static NUAN_ERROR _TTS_GetFrequency(VAUTO_HINSTANCE hTtsInst, short *pFreq) { NUAN_ERROR Error = NUAN_OK; VAUTO_PARAM TtsParam; /*-- get frequency used by current voicefont --*/ TtsParam.eID = VAUTO_PARAM_FREQUENCY; if (NUAN_OK != (Error = vauto_ttsGetParamList (hTtsInst, &TtsParam, 1)) ) { ErrorV(_T("vauto_ttsGetParamList rc=0x%1!x!\n"), Error); return Error; } switch(TtsParam.uValue.usValue) { case VAUTO_FREQ_8KHZ: *pFreq = 8000; break; case VAUTO_FREQ_11KHZ: *pFreq = 11025; break; case VAUTO_FREQ_16KHZ: *pFreq = 16000; break; case VAUTO_FREQ_22KHZ: *pFreq = 22050; break; default: break; } return NUAN_OK; } int TTS_SelectLanguage(int langId) { NUAN_ERROR nrc; VAUTO_LANGUAGE arrLanguages[16]; VAUTO_VOICEINFO arrVoices[4]; VAUTO_SPEECHDBINFO arrSpeechDB[4]; NUAN_U16 nLanguageCount, nVoiceCount, nSpeechDBCount; nLanguageCount = sizeof(arrLanguages)/sizeof(arrLanguages[0]); nVoiceCount = sizeof(arrVoices) /sizeof(arrVoices[0]); nSpeechDBCount = sizeof(arrSpeechDB)/sizeof(arrSpeechDB[0]); int nVoice = 0, nSpeechDB = 0; nrc = vauto_ttsGetLanguageList( _hSpeech, &arrLanguages[0], &nLanguageCount); if(nrc != NUAN_OK){ TTS_ErrorV(_T("vauto_ttsGetLanguageList rc=0x%1!x!\n"), nrc); return 0; } if(nLanguageCount == 0 || nLanguageCount<=langId){ TTS_Error(_T("vauto_ttsGetLanguageList: No proper languages found.\n")); return 0; } _szLangTLW = arrLanguages[langId].szLanguageTLW; NUAN_TCHAR* szLanguage = arrLanguages[langId].szLanguage; nVoice = 0; // select first voice; NUAN_TCHAR* szVoiceName = arrVoices[nVoice].szVoiceName; nSpeechDB = 0; // select first speech DB { VAUTO_PARAM stTtsParam[7]; int cnt = 0; // language stTtsParam[cnt].eID = VAUTO_PARAM_LANGUAGE; _tcscpy(stTtsParam[cnt].uValue.szStringValue, szLanguage); cnt++; // voice stTtsParam[cnt].eID = VAUTO_PARAM_VOICE; _tcscpy(stTtsParam[cnt].uValue.szStringValue, szVoiceName); cnt++; // speechbase parameter - frequency stTtsParam[cnt].eID = VAUTO_PARAM_FREQUENCY; stTtsParam[cnt].uValue.usValue = arrSpeechDB[nSpeechDB].u16Freq; cnt++; // speechbase parameter - reduction type stTtsParam[cnt].eID = VAUTO_PARAM_VOICE_MODEL; _tcscpy(stTtsParam[cnt].uValue.szStringValue, arrSpeechDB[nSpeechDB].szVoiceModel); cnt++; if (_speechRate) { // Speech rate stTtsParam[cnt].eID = VAUTO_PARAM_SPEECHRATE; stTtsParam[cnt].uValue.usValue = _speechRate; cnt++; } if (_volume) { // Speech volume stTtsParam[cnt].eID = VAUTO_PARAM_VOLUME; stTtsParam[cnt].uValue.usValue = _volume; cnt++; } nrc = vauto_ttsSetParamList(_hTtsInst, &stTtsParam[0], cnt); if(nrc != NUAN_OK){ ErrorV(_T("vauto_ttsSetParamList rc=0x%1!x!\n"), nrc); return 0; } } return 1; } int TTS_Initialize(int defLanguageId) { NUAN_ERROR nrc; nrc = vplatform_GetInterfaces(&_stInstall, &_stResources); if(nrc != NUAN_OK){ Error(_T("vplatform_GetInterfaces rc=%1!d!\n"), nrc); return 0; } nrc = vauto_ttsInitialize(&_stInstall, &_hSpeech); if(nrc != NUAN_OK){ Error(_T("vauto_ttsInitialize rc=0x%1!x!\n"), nrc); TTS_Cleanup(); return 0; } nrc = vauto_ttsOpen(_hSpeech, _stInstall.hHeap, _stInstall.hLog, &_hTtsInst, NULL); if(nrc != NUAN_OK){ ErrorV(_T("vauto_ttsOpen rc=0x%1!x!\n"), nrc); TTS_Cleanup(); return 0; } // Ok, time to select language if(!TTS_SelectLanguage(defLanguageId)){ TTS_Cleanup(); return 0; } // init Wave out device { short freq; if (NUAN_OK != _TTS_GetFrequency(_hTtsInst, &freq)) { TTS_ErrorV(_T("_TTS_GetFrequency rc=0x%1!x!\n"), nrc); TTS_Cleanup(); return 0; } _waveOut = WaveOut_Open(freq, 1, 4); if (_waveOut == NULL){ TTS_Cleanup(); return 0; } } // init TTS output { VAUTO_OUTDEVINFO stOutDevInfo; stOutDevInfo.hOutDevInstance = _waveOut; stOutDevInfo.pfOutNotify = TTS_Callback; // Notify using callback! nrc = vauto_ttsSetOutDevice(_hTtsInst, &stOutDevInfo); if(nrc != NUAN_OK){ ErrorV(_T("vauto_ttsSetOutDevice rc=0x%1!x!\n"), nrc); TTS_Cleanup(); return 0; } } // OK TTS engine initialized return 1; } void TTS_Cleanup(void) { if(_hTtsInst.pHandleData){ vauto_ttsStop(_hTtsInst); vauto_ttsClose(_hTtsInst); } if(_hSpeech.pHandleData){ vauto_ttsUnInitialize(_hSpeech); } if(_waveOut){ WaveOut_Close(_waveOut); _waveOut = NULL; } vplatform_ReleaseInterfaces(&_stInstall); memset(&_stInstall, 0, sizeof(_stInstall)); _stInstall.fmtVersion = VAUTO_CURRENT_VERSION; } int TTS_Speak(const TCHAR * const message, int length) { VAUTO_INTEXT stText; stText.eTextFormat = VAUTO_NORM_TEXT; stText.szInText = (void*) message; stText.ulTextLength = length * sizeof(NUAN_TCHAR); TraceV(_T("TTS_Speak: %1\n"), message); NUAN_ERROR rc = vauto_ttsProcessText2Speech(_hTtsInst, &stText); if (rc == NUAN_OK) { return 1; } if (rc == NUAN_E_TTS_USERSTOP) { return 2; } ErrorV(_T("vauto_ttsProcessText2Speech rc=0x%1!x!\n"), rc); return 0; } static NUAN_ERROR TTS_Callback (VAUTO_HINSTANCE hTtsInst, VAUTO_OUTDEV_HINSTANCE hOutDevInst, VAUTO_CALLBACKMSG * pcbMessage, VAUTO_USERDATA UserData) { VAUTO_OUTDATA * outData; switch(pcbMessage->eMessage){ case VAUTO_MSG_BEGINPROCESS: WaveOut_Start(_waveOut); break; case VAUTO_MSG_ENDPROCESS: break; case VAUTO_MSG_STOP: break; case VAUTO_MSG_OUTBUFREQ: outData = (VAUTO_OUTDATA *)pcbMessage->pParam; memset(outData, 0, sizeof(VAUTO_OUTDATA)); { WaveOutBuf * buf = WaveOut_GetBuffer(_waveOut); if(buf){ VAUTO_OUTDATA * outData = (VAUTO_OUTDATA *)pcbMessage->pParam; outData->eAudioFormat = VAUTO_16LINEAR; outData->pOutPcmBuf = WaveOutBuf_Data(buf); outData->ulPcmBufLen = WaveOutBuf_Size(buf); _curBuffer = buf; break; } TTS_Trace(_T("VAUTO_MSG_OUTBUFREQ: processing was stopped\n")); } return NUAN_E_TTS_USERSTOP; case VAUTO_MSG_OUTBUFDONE: outData = (VAUTO_OUTDATA *)pcbMessage->pParam; WaveOutBuf_SetSize(_curBuffer, outData->ulPcmBufLen); WaveOut_PutBuffer(_waveOut, _curBuffer); _curBuffer = NULL; break; default: break; } return NUAN_OK; }

As the reader may notice, the code is rather cumbersome, and simple (seemingly) functionality requires a large number of presets. Alas, this is the reverse side of the flexibility of the engine. Of course, the API of other engines and for other languages can be significantly simpler and more compact.

8. Again phonemes

Looking at the API, the reader may ask - why do we need phonemes at all if TTS (Text-To-Speech) is able to directly convert text to speech. He knows how, but there is one "but." Familiar words are well converted into speech. The situation is much worse with the words "unfamiliar". Such as toponyms, proper names, etc. This is particularly well illustrated by the example of multinational countries, such as, for example, Russia. The names of cities and villages on the territory of one ever-sixth part of the land were given by different people, in different languages and at different times. The need to write them in Russian letters played a bad joke with the national languages. The phonemes of Tatars, Nenets, Abkhazians, Kazakhs, Yakuts, Buryats were crammed into the Procrustean bed of the Russian language. In which, although there are many phonemes, it is still not enough to convey all the languages of the peoples of the former Soviet Union. But then it’s even worse - if the phonetic record is somehow similar to the original, then reading the TTS engine for the name of the type “Kyuchuk-Kaynardzhi” does not cause anything but laughter.

However, it is naive to think that this is only a problem of the Russian language. Similar difficulties exist in more homogeneous countries. Thus, in French, the letters p, b, d, t, s at the end of words are usually not read. But if we take toponyms, then local traditions come into force. So, in the word Paris 's', in the end, it is not really pronounced, but in the word 'Valluris' - the opposite. The difference is that Paris is located in the north of France, and Vallauris - in the south, in Provence, where the rules of pronunciation are somewhat different. That is why it is still desirable to have phonetic transcriptions for words. Usually cards come with it. True, unity in the format is not observed. So, NavTeq traditionally uses X-SAMPA transcription, and TomTom uses LH +. Well, if your TTS-engine perceives both, and if not? Here it is necessary to pervert. For example, to convert one transcription into another, which in itself is not trivial. If there is no phonetic information at all, the engine has its own methods for obtaining it. If we talk about the engine, Nuance is “Data Driven Grapheme To Phoneme” (DDG2P) and “Common Linguistic Component” (CLC). However, the use of these options is an extreme measure.

9. Special sequences

Nuance provides not only the ability to pronounce text or phonetic recording, but also dynamically switch between them. To do this, use the escape sequence of the form: <ESC> / +

In general, using the escape sequences you can set a lot of parameters. In general terms, it looks like this:

                                          <ESC> \ <param> = <value> \

For example,

\ x1b \ rate = 110 \ - sets the speed of pronunciation
\ x1b \ vol = 5 \ - sets the volume
\ x1b \ audio = "beep.wav" \ - inserts data from a wav file into the audio stream.

Similarly, you can force the engine to spell a word, insert pauses, change the voice (for example, from male to female) and much more. Of course, not all sequences can be useful to you, but in general this is a very useful feature.

10. Dictionaries

Sometimes you need to pronounce a set of words in a certain way (abbreviations, abbreviations, proper names, etc), but in each case you want to replace the text with a phonetic transcription (and this is not always possible). In this case dictionaries come to the rescue. What is vocabulary in Nuance terminology? This is a file with a set of pairs: <text> <transcription>. This file is compiled and then loaded by the engine. When pronouncing, the engine checks whether the word / text is present in the dictionary and, if so, replaces it with its phonetic transcription. Here, for example, a dictionary containing the names of streets and squares of the Vatican.

 [Header]
 Name = Vaticano
 Language = ITI
 Content = EDCT_CONTENT_BROAD_NARROWS
 Representation = EDCT_REPR_SZZ_STRING
 [Data]
 "Largo del Colonnato" // 'lar.go_del_ko.lo.'n: a.to
 "Piazza del Governatorato" // 'pja.t & s: a_del_go.ver.na.to.'ra.to
 "Piazza della Stazione" // 'pja.t & s: a_de.l: a_sta.'t & s: jo.ne
 "Piazza di Santa Marta" // 'pja.t & s: a_di_'san.ta_'mar.ta
 "Piazza San Pietro" // 'pja.t & s: a_'sam_'pjE.tro
 "Piazzetta Châteauneuf Du Pape" // pja.'t & s: et: a_Sa.to.'nef_du_'pap
 "Salita ai Giardini" // sa.'li.ta_aj_d & Zar.'di.ni
 "Stradone dei Giardini" // stra.'do.ne_dej_d & Zar.'di.ni
 "Via dei Pellegrini" // 'vi.a_dej_pe.l: e.'gri.ni
 "Via del Fondamento" // 'vi.a_del_fon.da.'men.to
 "Via del Governatorato" // 'vi.a_del_go.ver.na.to.'ra.to
 "Via della Posta" // 'vi.a_de.l: a_'pOs.ta
 "Via della Stazione Vaticana" // 'vi.a_de.l: a_sta.'t & s: jo.ne_va.ti.'ka.na
 "Via della Tipografia" // 'vi.a_de.l: a_ti.po.gra.'fi.a
 "Via di Porta Angelica" // 'vi.a_di_'pOr.ta_an.'d & ZE.li.ka
 "Via Tunica" // 'vi.a_'tu.ni.ka
 "Viale Centro del Bosco" // vi.'a.le_'t & SEn.tro_del_'bOs.ko
 "Viale del Giardino Quadrato" // vi.'a.le_del_d & Zar.'di.no_kwa.'dra.to
 "Viale Vaticano" // vi.'a.le_va.ti.'ka.no

11. Recognition

Speech recognition is even more challenging than its synthesis. If synthesizers somehow worked in the good old days, then sensible recognition became available only now. There are several reasons for this, the first of which is very similar to the problems of an ordinary living person facing an unfamiliar language, the second is a collision with text from an unfamiliar area.

Perceiving vibrations of sound that remind us of a voice, we first try to divide it into phonemes, isolate familiar sounds that we have to form into words. If the language is familiar to us, then it is easy to do; if not, then most likely even “correctly” it will not be possible to decompose speech into phonemes (remember the story about “Alla, I'm in the bar!”). Where we hear one, the speaker is something completely different. This happens because over the years our brain has been “harassed” on certain phonemes, and over time, it has become accustomed to perceive only them. Encountering an unfamiliar sound, he tries to choose the phoneme of his native language [tongues] closest to what he heard. In a sense, this is similar to the vector quantization method used in speech codecs such as CELP. Not the fact that such an approximation will be successful. That is why “native” phonemes will be “convenient” for us.

Remember, while still in the USSR, while studying at school, and meeting with foreigners, we tried to "transliterate" our name, saying:
"May name of bice ptroff"
The teachers then scolded us, saying - why are you distorting your last name? Do you think he will be more clear about this? Speak Russian!

Alas, they deceived us or they were mistaken here ... If you could say your name in English / German / Chinese, then a native speaker will really find it easier to understand. The Chinese understood this a long time ago, and in order to communicate with western partners, they take for themselves special “European” names. In machine recognition, a particular language is described by the so-called acoustic model. Before text recognition, we have to load the acoustic model of a specific language, thus making it clear to the program what phonemes to wait for it at the entrance.

The second problem is no less complicated. Let us return again to our analogy with a living person. Listening to the interlocutor, we subconsciously build in our head a model of what he will say next, that is, create the context of the conversation. And if it is SUDDENLY to insert words out of context into the narrative (for example, “evolvent” when it comes to football), one can cause cognitive dissonance in the interlocutor. Roughly speaking, this dissonance happens all the time at a computer, for it never knows what to expect from a person. A person is easier - he can ask the interlocutor. And what should a computer do? To solve this problem and give the computer the right context, grammar is used.

12. Grammar

Grammar (usually defined in the form of BNF) just give the computer (more precisely the ASR engine) an idea of what to expect from the user at this particular moment. Usually, these are several alternatives combined by 'or', but more complex grammars are possible. Here is an example of grammar for choosing Kazan metro stations:

 # BNF + EM V1.0;
 ! grammar test;
 ! start <metro_KAZAN_stations>;
 <metro_KAZAN_stations>:
 "Ametyevo"! Id (0)! Pronounce ("^. M% je.t% jjI.vo-") |
 "Aircraft"! Id (1)! Pronounce ("^ v% jI'astro-'it% jIl% jno-j ^") |
 "Slides"! Id (2)! Pronounce ("'gor.k% jI") |
 "Goat settlement"! Id (3)! Pronounce ("'ko.z% jj ^ _slo-.b ^.' Da") |
 "Kremlin"! Id (4)! Pronounce ("kr% jIm.'l% jof.sko-.j ^") |
 "Gabdulla Tukai Square"! Id (5)! Pronounce ("'plo.S% jIt% j_go-.bdu.'li0_'tu.ko-.j ^") |
 "Victory Avenue"! Id (6)! Pronounce ("pr ^. 'Sp% jekt_p ^.' B% je.di0") |
 "North Station"! Id (7)! Pronounce ("'s% je.v% jIr.ni0j_v ^ g.'zal") |
 "Cloak settlement"! Id (8)! Pronounce ("'su.ko-.no-.j ^ _slo-.b ^.' Da") |
 Yashlek! Id (9)! Pronounce ("ja.'Sl% jek");

As you can see, each line represents one of the alternatives, consisting of the actual text, integer id and phoneme. A phoneme is not necessary in general, but with it recognition will be more accurate.

How big can grammar be? Big enough. For example, in our experiments, 37 thousand alternatives are recognized at an acceptable level. Much worse is the case with complex and branched grammars. The recognition time increases, and the quality falls, and the dependence on the grammar length is non-linear. Therefore, my advice is to avoid complex grammars. Anyway, bye.

Grammar (as well as contexts) are static and dynamic. You have already seen an example of a static grammar; it is compiled in advance and stored in the internal binary representation of the engine. However, sometimes the context changes during interaction with the user. — . , , . . , « » . , , , ( 100 ) .

13. ASR API

, . — . - «---» . , ASR ( ). . , . (, ) . , («»|«») .

:

ConstructRecognizer() — «»
DestroyRecognizer() — «»
ASR_Initialize() — ASR
ASR_UnInitialize() — ASR
evt_HandleEvent — thread- «»
ProcessResult() —

ASR

 typedef struct RECOG_OBJECTS_S { void *pHeapInst; // Pointer to the heap. const char *acmod; // path to acmod data const char *ddg2p; // path to ddg2p data const char *clc; // path to clc data const char *dct; // path to dct data const char *dynctx; // path to empty dyn ctx data LH_COMPONENT hCompBase; // Handle to the base component. LH_COMPONENT hCompAsr; // Handle to the ASR component. LH_COMPONENT hCompPron; // Handle to the pron component (dyn ctx) LH_OBJECT hAcMod; // Handle to the AcMod object. LH_OBJECT hRec; // Handle to the SingleThreadedRec Object LH_OBJECT hLex; // Handle to lexicon object (dyn ctx) LH_OBJECT hDdg2p; // Handle to ddg2p object (dyn ctx) LH_OBJECT hClc; // Handle to the CLC (DDG2P backup) LH_OBJECT hDct; // Handle to dictionary object (dyn ctx) LH_OBJECT hCache; // Handle to cache object (dyn ctx) LH_OBJECT hCtx[5]; // Handle to the Context object. LH_OBJECT hResults[5]; // Handle to the Best results object. ASRResult *results[5]; // recognition results temporary storage LH_OBJECT hUswCtx; // Handle to the UserWord Context object. LH_OBJECT hUswResult; // Handle to the UserWord Result object. unsigned long sampleFreq; // Sampling frequency. unsigned long frameShiftSamples; // Size of one frame in samples int requestCancel; // boolean indicating user wants to cancel recognition // used to generate transcriptions for dyn ctx LH_BNF_TERMINAL *pTerminals; unsigned int terminals_count; unsigned int *terminals_transtype; // array with same size as pTerminals; each value indicates the type of transcription in pTerminal: user-provided, from_ddg2p, from_dct, from_clc SLOT_TERMINAL_LIST *pSlots; unsigned int slots_count; // reco options int isNumber; // set to 1 when doing number recognition const char * UswFile; // path to file where userword should be recorded char * staticCtxID; } RECOG_OBJECTS; // store ASR objects static RECOG_OBJECTS recogObjects; static int ConstructRecognizer(RECOG_OBJECTS *pRecogObjects, const char *szAcModFN, const char * ddg2p, const char * clc, const char * dct, const char * dynctx) { LH_ERROR lhErr = LH_OK; PH_ERROR phErr = PH_OK; ST_ERROR stErr = ST_OK; LH_ISTREAM_INTERFACE IStreamInterface; void *pIStreamAcMod = NULL; LH_ACMOD_INFO *pAcModInfo; LH_AUDIOCHAINEVENT_INTERFACE EventInterface; /* close old objects */ if(!lh_ObjIsNull(pRecogObjects->hAcMod)){ DestroyRecognizer(pRecogObjects); } pRecogObjects->sampleFreq = 0; pRecogObjects->requestCancel = 0; pRecogObjects->pTerminals = NULL; pRecogObjects->terminals_count = 0; pRecogObjects->pSlots = NULL; pRecogObjects->slots_count = 0; pRecogObjects->staticCtxID = NULL; pRecogObjects->acmod = szAcModFN; pRecogObjects->ddg2p = ddg2p; pRecogObjects->clc = clc; pRecogObjects->dct = dct; pRecogObjects->dynctx = dynctx; EventInterface.pfevent = evt_HandleEvent; EventInterface.pfadvance = evt_Advance; // Create the input stream for the acoustic model. stErr = st_CreateStreamReaderFromFile(szAcModFN, &IStreamInterface, &pIStreamAcMod); if (ST_OK != stErr) goto error; // Create the AcMod object. lhErr = lh_CreateAcMod(pRecogObjects->hCompAsr, &IStreamInterface, pIStreamAcMod, NULL, &(pRecogObjects->hAcMod)); if (LH_OK != lhErr) goto error; // Retrieve some information from the AcMod object. lhErr = lh_AcModBorrowInfo(pRecogObjects->hAcMod, &pAcModInfo); if (LH_OK != lhErr) goto error; pRecogObjects->sampleFreq = pAcModInfo->sampleFrequency; pRecogObjects->frameShiftSamples = pAcModInfo->frameShift * pRecogObjects->sampleFreq/1000; // Create a SingleThreadRec object lhErr = lh_CreateSingleThreadRec(pRecogObjects->hCompAsr, &EventInterface, pRecogObjects, 3000, pRecogObjects->sampleFreq, pRecogObjects->hAcMod, &pRecogObjects->hRec); if (LH_OK != lhErr) goto error; // cretae DDG2P & lexicon for dyn ctx if (pRecogObjects->ddg2p) { int rc = InitDDG2P(pRecogObjects); if (rc<0) goto error; } else if (pRecogObjects->clc) { int rc = InitCLCandDCT(pRecogObjects); if (rc<0) goto error; } else { // TODO: what now? } // Return without errors. return 0; error: // Print an error message if the error comes from the private heap or stream component. // Errors from the VoCon3200 component have been printed by the callback. if (PH_OK != phErr) { printf("Error from the private heap component, error code = %d.\n", phErr); } if (ST_OK != stErr) { printf("Error from the stream component, error code = %d.\n", stErr); } return -1; } static int DestroyRecognizer(RECOG_OBJECTS *pRecogObjects) { unsigned int curCtx; if (!lh_ObjIsNull(pRecogObjects->hUswResult)){ lh_ObjClose(&pRecogObjects->hUswResult); pRecogObjects->hUswResult = lh_GetNullObj(); } if (!lh_ObjIsNull(pRecogObjects->hUswCtx)){ lh_ObjClose(&pRecogObjects->hUswCtx); pRecogObjects->hUswCtx = lh_GetNullObj(); } if (!lh_ObjIsNull(pRecogObjects->hDct)){ lh_ObjClose(&pRecogObjects->hDct); pRecogObjects->hDct = lh_GetNullObj(); } if (!lh_ObjIsNull(pRecogObjects->hCache)){ lh_ObjClose(&pRecogObjects->hCache); pRecogObjects->hCache = lh_GetNullObj(); } if (!lh_ObjIsNull(pRecogObjects->hClc)){ lh_ObjClose(&pRecogObjects->hClc); pRecogObjects->hClc = lh_GetNullObj(); } if (!lh_ObjIsNull(pRecogObjects->hLex)){ lh_LexClearG2P(pRecogObjects->hLex); lh_ObjClose(&pRecogObjects->hLex); pRecogObjects->hLex = lh_GetNullObj(); } if (!lh_ObjIsNull(pRecogObjects->hDdg2p)){ lh_DDG2PClearDct (pRecogObjects->hDdg2p); lh_ObjClose(&pRecogObjects->hDdg2p); pRecogObjects->hDdg2p = lh_GetNullObj(); } for(curCtx=0; curCtx<sizeof(recogObjects.hCtx)/sizeof(recogObjects.hCtx[0]); curCtx++){ if (!lh_ObjIsNull(pRecogObjects->hCtx[curCtx])){ lh_RecRemoveCtx(pRecogObjects->hRec, pRecogObjects->hCtx[curCtx]); lh_ObjClose(&pRecogObjects->hCtx[curCtx]); pRecogObjects->hCtx[curCtx] = lh_GetNullObj(); } if (!lh_ObjIsNull(pRecogObjects->hResults[curCtx])){ lh_ObjClose(&pRecogObjects->hResults[curCtx]); pRecogObjects->hResults[curCtx] = lh_GetNullObj(); } } if (!lh_ObjIsNull(pRecogObjects->hRec)){ lh_ObjClose(&pRecogObjects->hRec); pRecogObjects->hRec = lh_GetNullObj(); } if (!lh_ObjIsNull(pRecogObjects->hAcMod)){ lh_ObjClose(&pRecogObjects->hAcMod); pRecogObjects->hAcMod = lh_GetNullObj(); } return 0; } int ASR_Initialize(const char * acmod, const char * ddg2p, const char * clc, const char * dct, const char * dynctx) { int rc = 0; size_t curCtx; LH_HEAP_INTERFACE HeapInterface; // Initialization of all handles. recogObjects.pHeapInst = NULL; recogObjects.hCompBase = lh_GetNullComponent(); recogObjects.hCompAsr = lh_GetNullComponent(); recogObjects.hCompPron = lh_GetNullComponent(); recogObjects.hAcMod = lh_GetNullObj(); for(curCtx=0; curCtx<sizeof(recogObjects.hCtx)/sizeof(recogObjects.hCtx[0]); curCtx++){ recogObjects.hCtx[curCtx] = lh_GetNullObj(); recogObjects.hResults[curCtx] = lh_GetNullObj(); } recogObjects.hRec = lh_GetNullObj(); recogObjects.hLex = lh_GetNullObj(); recogObjects.hDdg2p = lh_GetNullObj(); recogObjects.hClc = lh_GetNullObj(); recogObjects.hCache = lh_GetNullObj(); recogObjects.hDct = lh_GetNullObj(); recogObjects.hUswCtx = lh_GetNullObj(); recogObjects.hUswResult = lh_GetNullObj(); recogObjects.sampleFreq = 0; recogObjects.requestCancel = 0; recogObjects.pTerminals = NULL; recogObjects.terminals_count= 0; recogObjects.pSlots = NULL; recogObjects.slots_count = 0; recogObjects.staticCtxID = NULL; // Construct all components and objects needed for recognition. // Connect the audiochain objects. if (acmod) { // initialize components // Create a base and an ASR component. (+pron for dyn ctx) if(LH_OK != lh_InitBase(&HeapInterface, recogObjects.pHeapInst, LhErrorCallBack, NULL, &recogObjects.hCompBase)) goto error; if(LH_OK != lh_InitAsr(recogObjects.hCompBase, &HeapInterface, recogObjects.pHeapInst, &recogObjects.hCompAsr)) goto error; if(LH_OK != lh_InitPron(recogObjects.hCompBase, &HeapInterface, recogObjects.pHeapInst, &recogObjects.hCompPron)) goto error; rc = ConstructRecognizer(&recogObjects, acmod, ddg2p, clc, dct, dynctx); if (rc<0) goto error; } return rc; error: // An error occured. Close the engine. CloseOnError(&recogObjects); return -1; } int ASR_UnInitialize(void) { int rc; // Disconnects the audiochain objects. // Closes all objects and components of the vocon recognizer. rc = DestroyRecognizer(&recogObjects); // Close the PRON component. lh_ComponentTerminate(&recogObjects.hCompPron); // Close the ASR and Base component. lh_ComponentTerminate(&recogObjects.hCompAsr); lh_ComponentTerminate(&recogObjects.hCompBase); return 0; } int evt_HandleEvent(void *pEvtInst, unsigned long type, LH_TIME timeMs) { RECOG_OBJECTS *pRecogObjects = (RECOG_OBJECTS*)pEvtInst; if ( type & LH_AUDIOCHAIN_EVENT_BOS ){ // ask upper level for beep printf ("Receiving event LH_AUDIOCHAIN_EVENT_BOS at time %d ms.\n", timeMs); } if ( type & LH_AUDIOCHAIN_EVENT_TS_FX ) { printf ("Receiving event LH_AUDIOCHAIN_EVENT_TS_FX at time %d ms.\n", timeMs); } if ( type & LH_AUDIOCHAIN_EVENT_TS_REC ) { printf ("Receiving event LH_AUDIOCHAIN_EVENT_TS_REC at time %d ms.\n", timeMs); } if ( type & LH_AUDIOCHAIN_EVENT_FX_ABNORMCOND ) { LH_ERROR lhErr = LH_OK; LH_FX_ABNORMCOND abnormCondition; printf ("Receiving event LH_AUDIOCHAIN_EVENT_FX_ABNORMCOND at time %d ms.\n", timeMs); // Find out what the exact abnormal condition is. lhErr = lh_FxGetAbnormCondition(pRecogObjects->hRec, &abnormCondition); if (LH_OK != lhErr) goto error; switch (abnormCondition) { case LH_FX_BADSNR: printf ("Abnormal condition: LH_FX_BADSNR.\n"); break; case LH_FX_OVERLOAD: printf ("Abnormal condition: LH_FX_OVERLOAD.\n"); break; case LH_FX_TOOQUIET: printf ("Abnormal condition: LH_FX_TOOQUIET.\n"); break; case LH_FX_NOSIGNAL: printf ("Abnormal condition: LH_FX_NOSIGNAL.\n"); break; case LH_FX_POORMIC: printf ("Abnormal condition: LH_FX_POORMIC.\n"); break; case LH_FX_NOLEADINGSILENCE: printf ("Abnormal condition: LH_FX_NOLEADINGSILENCE.\n"); break; } } // LH_AUDIOCHAIN_EVENT_FX_TIMER // It usually is used to get the signal level and SNR at regular intervals. if ( type & LH_AUDIOCHAIN_EVENT_FX_TIMER ) { LH_ERROR lhErr = LH_OK; LH_FX_SIGNAL_LEVELS SignalLevels; printf ("Receiving event LH_AUDIOCHAIN_EVENT_FX_TIMER at time %d ms.\n", timeMs); lhErr = lh_FxGetSignalLevels(pRecogObjects->hRec, &SignalLevels); if (LH_OK != lhErr) goto error; printf ("Signal level: %ddB, SNR: %ddB at time %dms.\n", SignalLevels.energy, SignalLevels.SNR, SignalLevels.timeMs); } // LH_AUDIOCHAIN_EVENT_RESULT if ( type & LH_AUDIOCHAIN_EVENT_RESULT ){ LH_ERROR lhErr = LH_OK; LH_OBJECT hNBestRes = lh_GetNullObj(); LH_OBJECT hCtx = lh_GetNullObj(); printf ("Receiving event LH_AUDIOCHAIN_EVENT_RESULT at time %d ms.\n", timeMs); // Get the NBest result object and process it. lhErr = lh_RecCreateResult (pRecogObjects->hRec, &hNBestRes); if (LH_OK == lhErr) { if (LH_OK == lh_ResultBorrowSourceCtx(hNBestRes, &hCtx)){ int i; int _ready = 0; for(i=0; i<sizeof(pRecogObjects->hCtx)/sizeof(pRecogObjects->hCtx[0]); i++){ if(!lh_ObjIsNull(pRecogObjects->hCtx[i])){ if(hCtx.pObj == pRecogObjects->hCtx[i].pObj){ if(!lh_ObjIsNull(pRecogObjects->hResults[i])){ lh_ObjClose(&pRecogObjects->hResults[i]); } pRecogObjects->hResults[i] = hNBestRes; hNBestRes = lh_GetNullObj(); _ready = 1; break; } } else { break; } } if (_ready) { for (i=0; i<sizeof(pRecogObjects->hCtx)/sizeof(pRecogObjects->hCtx[0]); i++) { if(!lh_ObjIsNull(pRecogObjects->hCtx[i])){ if(lh_ObjIsNull(pRecogObjects->hResults[i])){ _ready = 0; } } } } ASSERT(lh_ObjIsNull(hNBestRes)); if (_ready) { ProcessResult (pRecogObjects); for(i=0; i<sizeof(pRecogObjects->hResults)/sizeof(pRecogObjects->hResults[0]); i++){ if(!lh_ObjIsNull(pRecogObjects->hResults[i])){ lh_ObjClose(&pRecogObjects->hResults[i]); } } } } // Close the NBest result object. } } return 0; error: return -1; } static int ProcessResult (RECOG_OBJECTS *pRecogObjects) { LH_ERROR lhErr = LH_OK; size_t curCtx, i, k, count=0; size_t nbrHypothesis; ASRResult *r = NULL; long lid; // get total hyp count for(curCtx=0; curCtx<sizeof(pRecogObjects->hCtx)/sizeof(pRecogObjects->hCtx[0]); curCtx++){ if(!lh_ObjIsNull(pRecogObjects->hResults[curCtx])){ if(LH_OK == lh_NBestResultGetNbrHypotheses (pRecogObjects->hResults[curCtx], &nbrHypothesis)){ count += nbrHypothesis; } } } // traces printf ("\n"); printf (" __________RESULT %3d items max_______________\n", count); printf ("| | |\n"); printf ("| result | confi- | result string [start rule]\n"); printf ("| number | dence |\n"); printf ("|________|________|___________________________\n"); printf ("| | |\n"); if (count>0) { r = ASRResult_New(count); // Get & print out the result information for each hypothesis. count = 0; curCtx = sizeof(pRecogObjects->hCtx)/sizeof(pRecogObjects->hCtx[0]); for(; curCtx>0; curCtx--){ LH_OBJECT hNBestRes = pRecogObjects->hResults[curCtx-1]; if(!lh_ObjIsNull(hNBestRes)){ LH_HYPOTHESIS *pHypothesis; if(LH_OK == lh_NBestResultGetNbrHypotheses (hNBestRes, &nbrHypothesis)){ for (i = 0; i < nbrHypothesis; i++) { char *szResultWords; // Retrieve information on the recognition result. if (LH_OK == lh_NBestResultFetchHypothesis (hNBestRes, i, &pHypothesis)){ // Get the result string. if (LH_OK == lh_NBestResultFetchWords (hNBestRes, i, &szResultWords)){ printf ("| %6lu | %6lu | '%s' [%s]\n", i, pHypothesis->conf, szResultWords, pHypothesis->szStartRule); // Return the fetched data to the engine. lh_NBestResultReturnWords (hNBestRes, szResultWords); } lh_NBestResultReturnHypothesis (hNBestRes, pHypothesis); } } } } } } // traces printf ("|________|________|___________________________\n"); printf ("\n"); return 0; }

, TTS, , . ! . , , / « ».

14. ()

The last word of technology is now stream recognition, or dictation. The technology is already available on modern smartphones for Android and iOS. Including - in the form of API. Here, the programmer does not need to specify the recognition context when creating grammars. At the entrance there is a speech - at the exit, recognized words. Unfortunately, the details of how this method works are not available to me yet. The recognition process goes not on the device itself, but on the server where the voice is transmitted and from there the result is obtained. I would like, however, to believe that after years the technology will be available on the client side as well.

Conclusion

That's probably all that I wanted to talk about technologies ASR and TTS. I hope it turned out not too boring and informative enough. In any case, questions are welcome.

Source: https://habr.com/ru/post/264531/

All Articles