Hello colleagues! In this article, I will briefly describe the features of the construction of biometric verification / identification systems that our DATA4 team encountered when creating their own solution.
The identity authentication task is used in areas with the need for access control. These are banks, insurance companies, and other areas where confidential information is used.
Traditionally, authentication uses the principle of “key” knowledge, such as a password, control word, or passport number. The described method has a disadvantage - it is not the person that is confirmed, but the information known to the person.
Biometric solutions do not have this drawback.
A promising approach to solving the problem is voice authentication. The voice of each person is unique, and with a given accuracy it is possible to say to whom he belongs. For the problems of identification, this approach is not applicable, since at the current level of technology, the error of a “false pass” gives an error of 3-5%. The accuracy of the algorithms is 95–97%, which allows the technology to be used in the verification task.
')
An additional advantage of verification by voice is the reduction of authentication time in the contact center, which gives an economic effect proportional to the number of operators (savings on wages and communications). According to our calculations, the achievable effect of the introduction is up to 27 million rubles. per year for the contact center of 100 operators (including taxes, telephony costs, operators working in 2 shifts, etc.), but the figure is highly dependent on the specific case.
The principles of the classical approach
A person’s voice recording is a signal that needs to be processed, to extract the signs and build a classifier.
Our solution consists of 4 subsystems: the subsystem of digital signal processing, subsystem for feature extraction, subsystem for speech extraction and classifier [1].

Digital Signal Processing Subsystem
- The signal is filtered, the range being investigated is highlighted. The human ear hears frequencies of 20-20 thousand Hz, but for making biometric verification solutions the range of 300–3400 Hz is taken.
- The signal, by the method of fast Fourier transform, is transferred to the frequency domain.
Feature selection subsystem
- The signal is divided into segments of 20-25 ml.sec. Further we will call segments - frames.
- For each frame, small square coefficients, MFCC, and the first and second delta are determined. The first 13 MFCC coefficients are used. [2]
Speech extraction subsystems
- The feature vector is fed into a pre-trained binary speech classifier. The classifier, for each frame, determines the presence of speech. To maximize quality, tree composition based booster models, such as XGboost, are used. To maximize the speed of work, logistic regression or the support vector SVM method is used.
Classifier
- A mixture of distributions by selected features is constructed from frames in which speech was present [3]. You need to take at least 24-30 seconds of clear speech to train the model and 12-15 seconds to test.
- The total vector of attributes (i — vector) consisting of 100 values ​​is constructed from a mixture of distributions.
- The feature vector is fed to the binary classifier. In the traditional approach, SVM or boosting is used for classification. [four]
To work correctly, you need to set the error rates of the first and second kind. If it is required to minimize the error of false acceptance, then the “penalty” of the error of false acceptance exceeds the “penalty” of a false deviation error by 100–1000 times. We used a factor of 100.
To build a verification solution, data is required, marked by speakers and speech. It is recommended to use at least several hundred speakers in different acoustic conditions, such as phone models, room types, etc., in a speech quantity of at least 5-10 hours. We used our own dataset from more than 5 thousand audio files. This is necessary to avoid overtraining the algorithm. To minimize retraining, you should additionally use cross-qualification and regularization.
As a VAD (speech detector), you can use the following
solution from Google. But, if there is a desire to understand how this works, it is better to write your own solution based on XGboost. Achievable quality metric accuracy> 99%. From our experience, it is the quality of work of VAD that is the “bottleneck” for the final quality of work.
For digital signal processing tasks, the
Bob solution is known.
Summary
To build a speech verification solution, data, skills in digital signal processing and machine learning are required.
You can learn more about the principles of device verification solutions and the basics of machine learning and DSP in the accompanying literature.
Literature:
1. A.V. Kozlov, O. Yu. Kudashev, Yu.N. Matveyev, TS Pehovsky, K.K. Simonchik, A.K. Shulipa “Voice over voice identification system for the NIST SRE competition.” 20132. Yu.N. Matveyev. "Investigation of the informativeness of speech signs for systems of automatic speaker identification 20133. D.V. Baker, S.G. Tikhorenko. “Algorithm of use of Gaussian mixtures for speaker identification by voice in technical systems”.4. N.S. Klimenko, I.G. Gerasimov. "Study of the effectiveness of boosting in the task of the text-independent identification of speakers." 2014Useful resources:
1. A course on machine learning from the MIPT on a trainee;2. Course on DSP from MIPT on the internal portal.