A simple speech recognition algorithm based on a short dictionary based on the MFCC

Greetings to all readers of habrahabr!

Recently there has been a significant increase in interest in speech recognition technology. There are several reasons for this growth, in particular, a significant increase in computational capabilities and training material. On Habrahar user domage was published a whole series of articles on the basics of speech recognition technology. Also worth noting is the article Mel-cepstral coefficients (MFCC) and speech recognition and the work done on its basis for identifying a person by voice: Who is it? - Identification of a person by voice .
This paper proposes a simple algorithm (and its implementation in C ++) of a speech recognition system using a short vocabulary based on an analysis of the statistical distribution of chalk-cepstral coefficients ( Mel-frequency cepstrum coefficients , MFCC).

Formulation of the problem

There are many methods of speech recognition, in most cases they are based on the methods of statistical analysis and probability theory (Hidden Markov Model, Gaussian Mixture Model, etc.). As you know, google provides a free service for recognizing short voice messages. On the basis of this service, speech recognition using a microcontroller was even suggested: Speech recognition on STM32F4-Discovery . However, the question arises: is there an opportunity to make your speech recognition system, even if it is in a rather limited dictionary, without using “external” services, while at the same time working quickly and with acceptable quality?
')

main idea

So, for speech recognition we will use MFCC. In order not to go into details, I’ll say that it is worth treating them only as some kind of filter whose input is a phonogram, and the output is a set of vectors (coefficients), which we will recognize as some word or set of words. In fairness, it is worth noting that there are many other acoustic features used for speech recognition: Perceptual linear predictive (PLP), Linear prediction cepstral coefficient (LPCC), Linear frequency cepstral coefficients (LFCC).
The basic idea is to use linear discriminant analysis to identify a word. However, it is applicable only for vectors of the same dimension. Because words can be of different lengths, the question arises: how to transform a sequence of an arbitrary number of MFCC vectors into a vector of fixed dimension?
One can proceed as follows: find the places of "thickening" of the distribution of these vectors and take the concatenation of vectors, which are the centers of "thickening", as the resulting vector. Such a concatenated vector will be called the super vector of averages, and the centers themselves will be called averages. At the same time, we will use the super vector of averages obtained on all MFCC vectors of the entire training base as a “starting point”. Thus transforming the sequence of MFCC vectors into one super vector of mediums of a fixed dimension, we can apply various classification methods.
The principal disadvantage of this approach is obvious: the dynamics of the distribution of MFCC signs over time is not taken into account, therefore, the system is not a priori capable of distinguishing, for example, the words “head fish” and “abyrwalg”, because the overall distribution of the MFCC vectors of such words will be approximately the same (respectively, the centers of "condensations" will coincide).

Algorithm Description

As a learning base, we will use a multitude of files, each of which is a set of MFCC vectors obtained from a phonogram with a recording of one or another word. In this case, files with the record of the same word should be combined into one group.
Here is the distribution of the first two components of the MFCC vectors of the entire training base:

The algorithm consists of the following steps:

We find the super vector of averages for the entire training base using the K- means algorithm.
An example of the operation of the K-means algorithm for K = 10 is shown in the figure:

where the big red squares are the desired mean values.
For each base file, we find our own average values using the formula:
Mk = a * Mk0 + (1 - a) * Mk ', k = 1: K
where Mk0 is the average value found in claim 1,
Mk 'is the average value obtained as a result of applying one iteration of the K-means algorithm for the MFCC vectors of a file using Mk0 as the initial value,
a = R / (R + Nk), where R is the “sensitivity” coefficient, Nk is the number of MFCC vectors corresponding to the mean value Mk '.
The average values found in this way will be called the accepted average values.
An example of adapted mean values for a file is shown in the figure:
Having now, instead of the original phonograms, the adapted super vector of the averages, we carry out LDA for N classes (each class corresponds to one word).
As a result, we must obtain a matrix consisting of vectors of a new basis, when projected onto which the initial adapted super vector of averages should be divided sufficiently well. Example for N = 4:
We project all the adapted super vector of averages onto a new basis and find the mean values and the standard deviation (standard deviation) of the projections for each class.
To determine whether a test track belongs to a particular class (i.e., recognition), we perform paragraphs for it. 2 and 4, then we find the distances of the obtained projection to the average values of all classes (we can additionally normalize them to the corresponding standard deviation). The minimum distance and will correspond to the class to which the test phonogram belongs.

Implementation

A complete implementation of the described algorithm, along with the source codes and the base for testing, can be found here .
Creating your own word recognition system consists of the following steps:

Recording of phonograms for training and testing
For recording, you can use any program that can record sound and save it in WAVE format. I recommend using the free Audacity program.
The developed system is not able to allocate speech segments, so when recording you need to try, so that only the speech is present in the phonogram. The better the microphone is used, the better the system is obtained. It is necessary to record in mono mode with a sampling frequency of 16000.
Construction of MFCC vectors
To build MFCC vectors, you can use the free library SPro 5.0 . I took responsibility, went through this library a bit, corrected a couple of errors, and built the sfbcep.exe program under windows (see the ../spro-5.0 folder). The 32-bit version of this program is in the ../tools folder. To build MFCC vectors, I used the following parameters:
```
sfbcep.exe --format=wave --sample-rate=16000 --mel --freq-min=0 --freq-max=8000 --fft-length=256 --length=16.0 --shift=10.0 --num-ceps=13 [ WAVE-] [   MFCC-] 
```
Training and testing system
For learning and testing the system, I wrote the wrsystem program in C ++. The full source code is in the ../wrsystem folder. The 32-bit version of this program can be found in the ../tools folder.
The implementation of the LDA algorithm was borrowed from the ALGLIB library.
The wrsystem program has two modes of operation: training (in the case of the presence of the --learn parameter) and testing. This program takes three basic parameters as input:
- Path to the file with the description of the base of learning (testing) (parameter --base). An example of a file with a description of the database is in the ../base folder, and a description of the format can be viewed by running the program with the parameter --help.
- Path to the binary file that stores the result of the system learning (parameter --system). In training mode, this file is created; in test mode, it is read.
- Path to the file to which the system test results are recorded on the specified base: entanglement matrix and Word Error Rate value (parameter - test_results).

Experimental results

As an experiment, I created a system that can recognize the 14 words recorded in my voice. For learning the system, I recorded each word 4-5 times, and for testing - 7 times. The total training base contains 63 files, and the testing base contains 98. The following parameters were used when training:

Number of average values: 10
The coefficient of "sensitivity" in the adaptation: 20
Projection dimension: 20
Use of normalization on standard deviation: none

The test result on the basis of training showed the level of word recognition error (WER) 1.6%, and on the basis of testing 5.1%.

What you should pay attention to

I would like to make a few comments. First, in order for any system (including the one described here) to qualitatively recognize the speech of any person, it is necessary to have a huge training base with recording all the words spoken by different people in different emotional states using different recording devices (telephone, microphone, eavesdropping). device, etc.). Those. the system that you teach, using only your voice and only your home headset, probably will not work for your friends and even for you if you use any other microphone. Secondly, the described system has a very limited potential due to its triviality. Despite the fact that it works, this approach was proposed only as an experiment and is not suitable for industrial use without any modifications.

That's all, thank you for your attention!

Source: https://habr.com/ru/post/150251/

All Articles