Finding the best open source audio speech recognition system

Content:

1. Search and analysis of the optimal color space for the construction of eye-catching objects on a given class of images
2. Determination of the dominant signs of classification and the development of a mathematical model of facial expressions "
3. Synthesis of optimal facial recognition algorithm
4. Implementation and testing of facial recognition algorithm
5. Creating a test database of images of users' lips in various states to increase the accuracy of the system
6. Search for the best open source audio speech recognition system
7. Search for the optimal audio system of speech recognition with closed source code, but having open API, for the possibility of integration
8. Experiment for integrating video extensions into audio speech recognition system with test report

Instead of introducing

I decided to drop the article about how the compilation of a database on the lips, which began in a previous research paper . I note that the choice of a database for collecting information and its administration is carried out individually, depending on the goals and objectives that you face, as well as the available opportunities and your personal skills. Let us now turn to the direct testing of the developed algorithm using the example of current open source speech recognition systems. First, we will analyze the speech engines that have a free license.

Goals:

Determine the most optimal audio speech recognition system (speech engine) based on open source code (Open Source), which can be integrated into the developed system of video definition of the user's lip movement.

Tasks:

Identify audio speech recognition systems that fall under the concept of public domain. Consider the most well-known variants of voice-to-text speech conversion systems for the prospects of integrating a video module into the most optimal voice library. Make conclusions of the feasibility of using audio systems of speech recognition based on open source code for our goals and objectives.

Introduction

According to the linguistic features of human speech, additional articulation data can more accurately identify the speaker's speech and automatically break the sound wave into separate fragments. Also, with a general analysis of the audiovisual voice signal over time, there is the prospect of fixing open and closed syllables, voiced, hissing, percussion, unstressed vowels / consonants and other speech units. That is why in the task of high-quality speech recognition it is extremely important to create a library of data that these indicators could take into account together. This direction can be implemented if there is open access to language units. That is why to solve our problem (the implementation of video expansion to increase the accuracy of speech recognition programs) it is extremely important to consider open source speech recognition audio systems.

Types of licenses

Most modern products have the two most common types of licenses:
• Proprietary (proprietary) type, when the product is the private property of the authors and rightholders and does not meet the criteria for free software (there is not enough open source code). The copyright holder of proprietary software retains a monopoly on its use, copying and modification, in full or in significant points. Usually, proprietary software is called any proprietary software, including semi-proprietary software.
• Free licenses (open-source software) - open source software. The source code of such programs is available for viewing, studying and modifying, which allows the user to participate in the refinement of the open program itself, to use the code to create new programs and fix errors in them - by borrowing the source code, if the licenses allow it, or by studying the used algorithms, data structures, technologies, techniques and interfaces (since the source code can significantly complement the documentation, and in the absence of such, it itself serves as documentation).

Among the open-source speech recognition systems under consideration, we encountered 2 types of BSD and GPL sublicenses. Consider them in more detail.

BSD license

BSD (English Berkeley Software Distribution) is a software distribution system in source codes, created for the exchange of experience between educational institutions. A feature of BSD software packages was a special BSD license, which can be briefly described as follows: the entire source code is the property of BSD, all edits are the property of their authors.

GPL license

GPL - General Public License (sometimes translated as, for example, the GNU General Public License, the GNU General Public License, or the GNU Open License Agreement) is a license for free software created under the GNU project in 1988. It is also abbreviated as GNU GPL or even just the GPL, if it is clear from the context that we are talking about this particular license (there are quite a few other licenses containing the words “general public license” in the title). The second version of this license was released in 1991, the third version, after many years of work and a long discussion, in 2007. The GNU Lesser General Public License (LGPL) is a weaker version of the GPL, intended for some software libraries. The GNU Affero General Public License is an enhanced version of the GPL for programs designed to access them via the network [1].

The purpose of the GNU GPL is to give the user rights to copy, modify and distribute (including on a commercial basis) programs (which is prohibited by copyright law by default), and also to ensure that users of all derivative programs will receive the rights listed above. The principle of "inheritance" of rights is called "copyleft" (transliteration c English copyleft) and was invented by Richard Stallman. In contrast to the GPL, proprietary software licenses "very rarely give the user such rights and usually, on the contrary, seek to limit them, for example, by prohibiting the restoration of the source code" [2].

By licensing under the terms of the GNU GPL, the author retains the authorship.

Among the most common speech recognition systems, there are 2 types of licenses. The BSD license includes the following audio speech recognition products: CMU Sphinx, Julius. GPL licenses include: Simon software, iATROS, RWTH ASR (as a type of Q Public License (QPL) license), SHoUt, VoxForge (as a type - corpus, that is, a speech model as a corpus). Consider them in more detail:

Fig. 1. CMU Sphinx Emblem
CMU Sphinx , also simply called Sphinx for short, was mainly written by the Carnegie Mellon University's Speech Recognition Systems Development Team. It includes a series of speech recognizers (Sphinx 2-4) and an acoustic model trainer (Sphinx train).

In 2000, the Sphinx group at Carnegie Mellon University approved open source speech recognition components, including Sphinx 2 and later Sphinx 3 (in 2001). The speech decoder included acoustic models and simple applications. Available resources included an add-on acoustical training software, a language model, a linguistic distribution model and a public domain vocabulary (cmudict).

Sphinx is a continuous speech recognition recognizer that uses the Hidden Markov Model and the n-gram statistical language model. It was designed by Kay-Foo Lee. Sphinx has the ability to recognize long speech, a speaker-independent huge vocabulary of recognition, that is, the possibilities that in 1986 caused great disagreement in the speech recognition environment. Sphinx in historical development is notable for the fact that in its development it overshadowed all previous versions in terms of its performance. In this archive file you can familiarize yourself with the presented system in details [3].

Sphinx2 is the fastest and performance-oriented speech recognizer developed by Xuedong Huang at Carnegie Mellon University and released in Open Source based on SourceDorge BSD license for Kevin Lenzo at Linux World in 2000. Sphinx2 is focused on real-time speech recognition and is ideal for creating various mobile applications. Which includes such functionality as the final pointer, the partial generation of hypotheses, the connection of a dynamic language model, and so on. It uses the interactive system and language learning system. It can be used on computers on a PBX model, such as Asterisk. The Sphinx2 code has been incorporated into numerous commercial products. However, Sphinx2 did not develop actively for a long time (no more than routine maintenance). The current decoder for recognition in real-time mode is integrated into PocketSphinx. The description of the system can be found here .

Sphinx3 was a semi-continuous acoustic model of speech recognition (on the one hand, all types of Gauus mixtures were used with an individual model that took into account the weight of the vector over these Gauses). Sphinx3 adopted a common long-term model built on hidden Markov models and was used initially for high-precision speech recognition, which was carried out in the post-factum mode. Recent developments (algorithms and software) contributed to the fact that Sphinx3 could recognize in a mode close to the real-time type, although the application was not yet suitable for high-quality use as an application. Sphinx3 after active development and reunification with SphinxTrain provided access to numerous modern techniques and models, such as LDA / MLLT, MLLR and VTLN, which improved speech recognition accuracy.

Sphinx4 is a complete and rewritten Sphinx speech engine, the main goal of which is to provide a flexible framework for speech recognition research. Sphinx4 is written entirely in the Java programming language. Sun Microsystems has made a significant contribution to the development of Sphinx4 and assistance in the program expertise of the project. Individual project developers are people from MERL, MIT, CMU.
Current development goals include:
• Development of new acoustic models for training;
• Implementation of the speech adaptation system (MLLR)
• Configuration Management Improvements
• Implementation ConfDesigner - graphic design system

PocketSphinx - this version of Sphinx can be built into any other systems based on the ARM processor. PocketSphinx is actively developed and integrated into various systems with fixed-point arithmetic and into effective models based on a mixed model of calculations.

Fig. 2 emblem julius

Julius is a high-performance continuous speech recognition with a large vocabulary (large vocabulary continuous speech recognition), a software decoder for research in the field of related speech and development. It is ideal for almost real-time decoding on most existing computers, with a dictionary of 60 thousand words, using word thiagram tasks and a context-independent Hidden Markov model. The main feature of the project is complete embeddability. This secure modulation can also be independent of model structures and the various types of Hidden Markov models, which maintains the general state of triphons and associated model mixtures with a multitude of mixtures, phonemes and statements. Standard formats are active due to free modeling tools. The main platform of the Linux system and other UNIX-like stations, the system also runs on Windows. Julius is open source and distributed with a BSD license type.

Julius - developed as part of free software for research in the field of recognition of the Japanese language, since 1997 and this work continued as part of the Consortium of Continuous Speech Recognition Consortium (CSRC), from 2000 to 2003 in Japan.

Starting from version 3.4, the grammar database of the speech recognition system of the analyzer is called Julian and integrated into Julius. Julian is a modified form of Julius, which uses its own designed form of the state machine grammar (Finite-state machine) as a language model. It can be used for the construction of voice navigation systems with a small vocabulary or other conversational systems for recognizing various kinds of dialogues.

To run the speech recognizer Julius you need to choose a language model and an acoustic model for your language. Julius adapts the acoustic model of the HTK ASCII coding format, the HTK format pronunciation database, and a 3-layer diagram of the ARPA standard language model (2 direct and 3 reversal training models wrapped in a speech corpus with customer-inverted words).

Although Julius is distributed only for the Japanese language model, the VoxForge project is working on creating an acoustic model for the English language using the Julius speech recognition engine.

Fig. 3 RWTH ASR logo

RWTH ASR (RASR for short) is an open source speech recognition toolkit. The toolkit includes speech recognition technology for creating automated speech recognition systems. This technology is being developed by the Natural Language Technology Center and the Model Recognition Group at the Rhine-Westphalian Technical University of Aachen.

RWTH ASR includes tools for the development of acoustic models and decoders, as well as components for speaker speech adaptation, adaptive speaker training systems, unsupervised learning systems, differential learning systems, and grating word-processing forms [4]. This software runs on Linux and Mac OS X. The project’s home page offers ready-to-use models for research with tasks, training systems and comprehensive documentation.

The toolkit is published under the open source license, which is called the “RWTH ASR license”, which is derived from the QPL (Q Public License) license. This license represents free use, including redistribution and modification for non-commercial use.

Fig. 4. Simon emblem

Simon is a speech recognition system based on the Julius and HTK speech engines. The Simon system is designed in such a way that it is quite comfortable to work with different languages and various dialects. At the same time, the speech recognition reaction is fully customizable and it is not suitable for the exclusive recognition of single voice requests and cannot be configured to the needs of users.

To easily use the system, you must run certain "scripts". Simon packages are configured for special tasks. Among possible scenarios are Simon, for example, “Firefox” (launching and managing the browser “Firefox”) or “window management pack” (closing, moving, resizing windows) and so on. Scripts can easily be created by users and distributed to the community through the Get Hot New Stuff system. At present, more than 39 scripts have been written and 3 languages have been published on the opendesktop.org repository.

Simon also supports simple generic and basic GPL-like models from Voxforge, which users use to pronounce in English, German, and Portuguese. There is no need to train the system in order for it to start working. A demonstration of Simon 0.3.0 can be found at the link (http://www.youtube.com/watch?v=bjJCl72f-Gs). At the same time, the user's speech includes technical terminology - this is the main implementation feature in simon - and thus it demonstrates how it is possible to use simon for users and how simon can be integrated by users to enhance their own development [5].

Fig. 5. iATROS emblem
iATROS is a new version of the speech recognition system of the previous generation ATROS, which is suitable for both speech recognition and handwritten text. iATROS is based on a modular structure and can be used both to build differentiated models whose goal is to perform a Vettibri search based on the hidden Markov model. iATROS provides standard voice recognition tools both offline and online (based on ALSA modules).
iATROS consists of 2 preprocessing modules (for speech signal and handwritten images) and a recognition core module. Preliminary data processing and module extraction features are provided by module recognition vectors that use Hidden Markov models and language models, which are executed by searching for assumptions from the best speech recognition systems. All these modules are made in the programming language “C”. [6].

Fig. 6. SHoUt emblem
SHoUt is a toolkit written by Marjin Huijbregts at the Netherlands Institute for Sounds and Videos that is suitable for recognizing long speech with a large data dictionary. The toolkit consists of an application for training statistical models and for (non) / speech detection, diarization and speech decoding. [7].

Fig. 7. VoxForge emblem
VoxForge is a free speech corpus and language acoustic model presented on an Open Source data warehouse. VoxForge was built as a speech transcription repository with a free GPL package for use with speech engines that are open source. Speech audio files can be compiled into acoustic models for use with open source speech recognition systems such as Julius, Sphinx and HTK (but HTK has distribution restrictions).

Fig. 8. HTK Emblem
HTK is a speech recognition toolkit using a hidden Markov model. It is mainly intended for speech recognition, but it is also used for other other recognition applications that use the hidden Markov model (including speech synthesis, character recognition and DNA sequencing).

Conclusion

Thus, having considered the most common open source speech recognition systems, it should be noted that the solution based on the CMU Sphinx, especially PocketSphinx, is most strongly represented. It is most suitable for recognition tasks based on our visual expansion. However, it should be noted that this toolkit differs in general by its low database and, therefore, the recognition accuracy of such a system is much lower than that of speech recognition audio systems that have a proprietary license. The use of speech recognition systems and some of their open source tools will be quite possible when implementing their own audiovisual speech recognition system in a single multimodal system. In the meantime, at this stage, the use of such systems is not so important for us.

Bibliography

1) Determining license types for the Boston Software Foundation: www.fsf.org/news/agplv3-pr
2) Asya Vlasova How to steal Linux? (rus.) (24.06.2008) - on FOSS licenses and their application in Russia: www.osp.ru/cio/2008/06/4987902
3) Kai Fu Li, Hsiao-Wuen Hon. An overview of the Sphinx Speech Recognition Systems www.ri.cmu.edu/pub_files/pub2/lee_k_f_1990_1/lee_k_f_1990_1.pdf
4) Rybach, D .; C. Gollan, G. Heigold, B. Hoffmeister, J. Lööf, R. Schlüter, H. Ney (September 2009). "The RWTH Aachen University Open Source Speech Recognition System." Interspeech-2009: 2111–2114.
5) Peter Grasch: simon: Open Sourcing Speech Recognition with KDE technology: www.desktopsummit.org/program/sessions/simon-open-sourcing-speech-recognition-kde-technology
6) Interactive Analysis, Transcription and Translation of Old Text Documents: prhlt.iti.upv.es/page/projects/multimodal/idoc/iatros
7) SHOUT speech recognition toolkit: www.digibic.eu/techprofile.asp?slevel=0z84z101&parent_id=101&renleewtsapf=1255

To be continued

Source: https://habr.com/ru/post/230133/

All Articles