Search for an optimal audio system with closed source speech recognition, but with open APIs for integration

Instead of introducing

I decided to add a little report, which was still a student. The time has passed and, as they say, progress does not stand still. Speech recognition technologies are evolving dynamically. Something appears, something disappears. I present to your attention the most famous speech engines that a developer can use in his product based on a license agreement. I will be glad comments and additions.

Content:

1. Search and analysis of the optimal color space for the construction of eye-catching objects on a given class of images
2. Determination of the dominant signs of classification and the development of a mathematical model of facial expressions "
3. Synthesis of optimal facial recognition algorithm
4. Implementation and testing of facial recognition algorithm
5. Creating a test database of images of users' lips in various states to increase the accuracy of the system
6. Search for the best open source audio speech recognition system
7. Search for the optimal audio system of speech recognition with closed source code, but having open API, for the possibility of integration
8. Experiment for integrating video extensions into audio speech recognition system with test report

Goals:

Determine the most optimal audio speech recognition system (speech engine) based on a closed source code, that is, the license of which does not fit the definition of open source software.
')

Tasks:

Identify audio speech recognition systems that fall under the concept of a closed source code. Consider the most well-known variants of voice-to-text speech conversion systems for the prospects of integrating a video module into the most optimal voice library that has an open API for performing this operation. Make conclusions of the feasibility of using audio speech recognition systems based on a closed source code for our goals and objectives.

Introduction

Implementing your own speech recognition system is a very complex, time-consuming and resource-intensive task that is difficult to accomplish in this work. Therefore, it is supposed to integrate the presented video identification technology into speech recognition systems that have special capabilities for this. Since closed-source speech recognition systems are implemented with higher quality and higher speech recognition accuracy in them, the integration of our video development into their work should therefore be considered a more promising area than open-source audio speech recognition systems. However, it is necessary to keep in mind the fact that closed-source speech recognition systems often do not have adequate documentation to enable the integration of third-party solutions into their work, or this area is paid, that is, you need to buy a special license to use speech technologies presented by the licensee .

Closed source code (Proprietary software)

As for the definition - closed source code, it is necessary to say - it means that only binary (compiled) versions of the program are distributed and the license implies the lack of access to the source code of the program, which makes it difficult to create program modifications. Access to the source code to third parties is usually granted upon signing a non-disclosure agreement. [one].
Software with closed source code is proprietary (proprietary) software. However, it must be borne in mind that the phrase “closed source code” can be interpreted in different ways. Since it may imply licenses in which the source code of the programs is not available. However, if we consider it an open code antonym, it refers to software that does not fit the definition of an open source license, which has a slightly different meaning. One of those controversial issues was how to interpret the concepts of the application programming interface.

Application Programming Interface (API)

Since March 24, 2004, based on the decision of the European Commission for closed-source programs, the definition of an API appeared as a result of a lawsuit, which can be interpreted as an application programming interface or as an application programming interface (from English). API - a set of ready-made classes, procedures, functions, structures and constants provided by the application (library, service) for use in external software products. Used by programmers to write all sorts of applications.
The API defines the functionality that the program provides (module, library), while the API allows you to abstract from exactly how this functionality is implemented.
If a program (module, library) is considered as a black box, then an API is a set of “handles” that are available to the user of this box, which he can twist and pull.
Software components interact with each other through an API. In this case, the components usually form a hierarchy - the high-level components use the API of low-level components, and those, in turn, use the API of even lower-level components.
According to this principle, data transfer protocols are built over the Internet. The standard protocol stack (OSI network model) contains 7 layers (from the physical bit-transfer layer to the application protocol layer, similar to the HTTP and IMAP protocols). Each level uses the functionality of the previous data transfer level and, in turn, provides the necessary functionality to the next level.
It is important to note that the concept of protocol is close in meaning to the concept of API. Both that and another is abstraction of functionality, only in the first case it is a question of data transmission, and in the second ≈ about interaction of applications. [2].

Dragon mobile sdk

The toolbox itself is called NDEV. To get the necessary code and documentation, you need to register on the site in the "program of cooperation". Site:
dragonmobile.nuancemobiledeveloper.com/public/index.php [5].
The toolkit (SDK) contains the components of both the client and the server. The diagram illustrates their interaction at the top level:

Fig. 1. The principle of operation of the technology Dragon Mobile SDK

The Dragon Mobile SDK bundle consists of various code samples and project templates, documentation, and a software platform (framework) that simplifies the integration of speech services into any application.

The Speech Kit framework allows you to quickly and easily add speech recognition and synthesis (TTS, Text-to-Speech) services to your applications. This platform also provides access to speech processing components located on the server through asynchronous “clean” network APIs, minimizing overhead and resource consumption.

The Speech Kit platform is a full-featured high-level framework that automatically manages all low-level services.

Fig. 2. Speech Kit Architecture

The platform performs several consistent processes:
1. Provides full control of the audio system for recording and playback.
2. The network component manages connections to the server and automatically restores connections that have elapsed timeout with each new request.
3. The speech end detector detects when the user has finished speaking, and automatically stops recording if necessary.
4. The coding component compresses and decompresses audio streaming, reducing bandwidth requirements and reducing average latency.

The server system is responsible for most of the operations involved in the speech processing cycle. The process of recognition or speech synthesis is performed entirely on the server, processing or synthesizing the audio stream. In addition, the server performs authentication in accordance with the configuration of the developer.

The Speech Kit platform is a network service and needs some basic settings before starting to use recognition classes or speech synthesis.

This installation performs two basic operations:
First, it identifies and authorizes your application.
Secondly, it establishes a connection with a speech server, it allows you to make quick requests for voice processing and, therefore, improves the quality of user service.

Speech recognition

Recognition technology allows users to dictate instead of typing where text input is usually required. Speech Recognizer provides a list of text results. It is not tied to any user interface (UI) object, so the selection of the most appropriate result and the selection of alternative results are left to the user interface of each application.

Fig. 3. The process of speech recognition

In our application on the Android OS, we managed to integrate the solution from the Dragon Mobile SDK. The pioneer of the speech recognition industry has shown excellent results, especially in English. However, its major drawback is the limited free functionality: only 10 thousand requests per day - which very soon became insufficient for the operation of our application. For greater access should be paid .

Google Speech Recognition API

Fig. 4. Google Voice Search logo

This is a Google product that allows you to enter voice search using speech recognition technology. The technology is integrated into mobile phones and computers where you can enter information using voice. Since June 14, 2011, Google announced the integration of the speech engine in Google Search and since then it has been working in a stable manner since that time. This technology on personal computers is supported only by Google Chrome browser. The feature is enabled by default in dev-channel builds, but can be enabled manually by adding a command flag. There is also a voice control function for introducing voice commands on Android phones.

Initially, Google Voice Search - supported short search queries with a length of 35-40 words. It was necessary to send a request to turn the microphone on and off, which was not very natural to use (this function still remains in the Google search bar, you only need to click on the microphone). However, at the end of February 2013, the ability to recognize continuous speech was added to the Chrome browser and, in fact, Google Voice Search was transformed into Speech Input (you can try the technology on the example of typing in Google Translate ). The technology can be experimentally tested for example also here . View the full documentation here . We only note that if earlier many developers had sinned by illegally using Google’s Speech API recognition channel through various tricks, now, during frequent API changes, since May 2014, the process of accessing the API has actually become legalized, since data speech recognition system is enough to register an account with Google Developers and then you can work with the system within the legal field.

Voice Search comes with the following default services: Google, Wikipedia, YouTube, Bing, Yahoo, DuckDuckGo and Wolfram | Alpha and others. You can also add your own search engines. The extension also adds a voice input button for all sites using HTML5 search forms. A microphone is required for expansion. Speech input is very experimental, so don't be surprised if it doesn't work. [3].

To do this, to use Google Voice Search technology, you must do the following:
It is necessary to make a POST request to the address (now it changes frequently - for example, in May there were three changes and therefore you need to be ready for this) with sound data in FLAC or Speex format. Implemented a demonstration of recognition of WAVE-files using C #. The number of restrictions requests per day did not notice. There was a risk with 10,000 characters, like many other speech recognition systems, but such values were experimentally proved by us and can be overcome daily.

I will not dwell on how this technology works. There are a lot of articles available on the web, including on Habré. I note only that speech recognition systems have almost the same principle of operation, which was presented in the paragraph above using the example of Nuance.

Yandex Speech Kit

Fig. 5. Yandex Speech Kit logo

Immediately, I note that I personally did not work with this library. I'll tell you only about the experience of the programmer who worked with us. He said that the documentation is very difficult for his perception and the system has a limit on the number of requests: 10,000 per day, so in the end we did not use the database from Yandex. Although, according to the developers, this toolkit is number 1 for the Russian language and that the research team of the company, which worked alone in Switzerland, the other in Moscow was able to make a technological breakthrough in this area. However, with such a decision, it is rather difficult to enter the international market according to Grigory Bakunov, since “much in the field of speech recognition from the point of view of patenting belongs to the well-known Nuance, and Yandex was one of the last people who managed to cling to the car of the outgoing train of speech recognition systems.”

Brief description of the technology: api.yandex.ru/speechkit/
Documentation for Android: api.yandex.ru/speechkit/generated/android/html/index.html
Documentation for iOS: api.yandex.ru/speechkit/generated/ios/html/index.html

You can download the library on the Yandex Technologies portal: api.yandex.ru/speechkit/downloads/

Microsoft Speech API

Fig. 6. Logo Microsoft Speech API

Microsoft also recently began to actively develop speech technology. Especially after announcing the Cortana voice assistant and developing automatic technology of simultaneous tele-translation from English to German and vice versa for Skype
At the moment there are 4 use cases :

1. For Windows and Windows Server 2008. You can add a speech engine for Windows applications using managed or native code that can be taken from the API and control the speech engine that is built into Windows and Windows Server 2008.
2. Speech Platforms. Embedding the platform in applications that use distributive distributions from Microsoft (language packs with speech recognition or text-to-speech translation tools).
3. Embedded. Built-in solutions that allow a person to interact with devices using voice commands. For example, car control Ford using voice commands in OS WIndows Automotive
4. Services. Developing an application with speech functions that can be used in real time, thereby freeing itself from the creation, maintenance and upgrading of speech solutions infrastructure.

Microsoft Speech Platform (there is an SDK)

After installation - you can see help on the following path.
And also need to set another Runtime ( link )
as well as Runtime Languages (Version 11). Those. for each language you need to download and install a dictionary. I saw 2 versions of the dictionary for English and Russian languages.

System Requirements (for SDK)
OS support
Windows 7, Windows Server 2008, Windows Server 2008 R2, Windows Vista

Development and support
• Windows Vista or later
• Windows 2003 Server or later
• Windows 2008 Server or later

Deployment supported on:
• Windows 2003 Server or later
• Windows 2008 Server or later
Pros:
1) Ready technology, take it and use it! (there is an SDK)
2) Support from Microsoft

Minuses:
1) there is no separation from potential competitors
2) as I understand it - can be deployed only on the server Windows (Windows 2003 Server, Windows 2008 Server or later)
3) development for Windows 8 is not announced, only Windows 7 is still available and earlier versions of Windows

Using Microsoft Speech API 5.1 for speech synthesis and recognition

Article how to work with API

Installation (for Windows XP only), as I understand it, Speech API 5.1, it is now included in the Microsoft Speech Platform (v 11), so it makes sense to read the article .

Examples of projects for working with Microsoft Speech API

MSDN links about working with Microsoft.Speech.dll

1. How to start working with speech recognition system (Microsoft.Speech)

2. Speech recognition speech engine. Download grammar. Methods

Examples:

C #, Talk with a computer or System.Speech
A brief article on how to use System.Speech. The author points to the need for an English version of Windows Vista or 7.

Speech Recognition with C # - Dictation and User Grammar
Tutorial on how to use Microsoft system classes for audio recognition tasks (voice to text), the author also made a post to inverse text-to-speech problem on his blog.
The project (WinForms) on the tutorial starts and builds. There is a recognition of a 20 second interval. And recognition by a narrow dictionary for managing software - Choices ("Calculator", "Notepad", "Internet Explorer", "Paint"); If you say the phrase "start calculator", etc. This starts the corresponding software.

C # Speech to Text
Client on WPF.
The purpose of this article is to give you a small idea of the capabilities of the system. In detail, consider how the speech engine classes work. You can also find all MSDN documentation here .

Creating grammar in .NET

Examples of working with the GrammarBuilder class
The Microsoft .NET Speech API allows you to quickly and easily create applications that will provide benefits for interacting with a Microsoft research center that specializes in speech recognition. You can build grammatical processes and forms for work. In this article, an example of how you can implement all this. in the C # programming language.

Speech for Windows Phone 8

Here we consider aspects of audio recognition programming under Windows Phone 8.

Conclusion

Thus, having considered the most common speech recognition systems with closed source code, it should be noted that, according to its data library, the product based on Dragon NaturallySpeaking should be considered the most accurate. It is most suitable for recognition tasks based on our visual mobile extension (as it has good documentation and simple API code for embedding). However, it should be noted that this toolkit has a very complex licensing system, procedures and rules for the use of this technology. Therefore, there is the difficulty of implementing a custom product on Dragon Mobile SDK.

Therefore, in this case, more correct, for our goals and objectives should be considered the use of Google speech tools, which is more embedded and faster due to the large computing power compared to the Dragon Mobile SDK. Also, the advantage of speech recognition from Google was the lack of restrictions on the number of requests per day (many closed-source speech recognition systems have a limit of 10,000 requests). This company also began to actively seek to develop its speech engine on the basis of a license agreement. Once again, in May 2014, the leapfrog of the frequent API change from the corporation began and in order to coordinate the process, you must have the status of GoogleDevelopers.

The great advantage of closed source recognition systems (but an open API for developers), compared to open source audio speech recognition systems, is high accuracy (due to huge database libraries) and speed of speech recognition, therefore their use to solve our task is the current direction.

Bibliography

1) Frequently Asked Questions (and Answers) about Copyright: www.chillingeffects.org/copyright/faq.cgi#QID805
2) Stoughton, Nick (April 2005). Update on Standards (PDF). USENIX. Retrieved 2009-06-04.
3) Kai Fu Li, Speech Input API Specification. Editor's Draft 18 October 2010 Latest Editor's Draft: dev.w3.org ... Editors: Satish Sampath, Google Inc. Bjorn Bringert, Google Inc.
4) Voice search in Google Chrome: habrahabr.ru/post/111201
5) Official Dragon Mobile SDK page: dragonmobile.nuancemobiledeveloper.com/public/index.php

To be continued

Source: https://habr.com/ru/post/231629/

All Articles