Speech technology. Voice biometrics for teapots on the example of work in the contact center

Hello.
Recently, I wrote this article about the recognition of continuous speech, and now I would like to write about voice biometrics, i.e. confirmation of a person’s voice by voice and recognition of a person’s voice.

Again, because my work is connected with contact centers (CC), then I will talk about them. This is also due to the fact that now they are actively interested in voice biometrics, which is not surprising, since the telephone channel is its ideal application.
- you do not see the subscriber on the other end of the wire;
- you can not use other modalities to confirm identity: on the face, on the retina, on the fingerprint.
- do not need additional scanning devices, such as those where you need to put your finger or to whom to show your eye.
- this is the cheapest way of biometrics, although slightly inferior in reliability to other methods. But since other modalities are not technically applicable by telephone in mass use, there is no choice in fact.
Of course, you can argue about the option of confirming the identity of the subscriber “based on knowledge” - these are passwords, secret words, TPIN codes (banks), passport data, etc. - but all this is not reliable from the point of view of security and requires memorizing information from the subscriber or always keep the information at hand, which is not very convenient for the subscriber and is not effective (costly) for CC.

To begin with, we will define the concepts of what is included in the concept of voice biometrics:
- This is identification , i.e. the identification of the person by voice. This is when an old buddy calls you from an unknown number and says: “Guess who it is?” And you try to find the best match in your head among all the known (familiar) voices. When the memory scan is over and you have found a less suitable match, you can already say: "Yeah, this is my classmate Serega with whom I have not spoken for 10 years." But the guarantee is that it is he, you do not have, and here comes the time of verification.
- Verification is a confirmation of the person’s voice, i.e. unique identification. To do this, we can ask to prove that Seryoga is exactly who he claims to be. We can ask him: “Tell me where we were at 6 am at the prom” - this information will allow us to confirm Seregi’s identity, since only he can be the carrier of this information (similar to the password I wrote above).
')
If you want a smarter definition, then:
Identification - Checks if a single vote sample matches a base of votes. As a result of identification, the system displays a list of individuals with similar voices in percentage terms. A 100% match means that the sample of voice completely coincides with the voice from the database and the identity is established reliably.
Verification - Performs a comparison of two voice samples: the voice of the person whose identity is to be confirmed, with the voice stored in the database of the system and whose identity has already been reliably established. As a result of verification, the system shows the degree of coincidence of one vote with another in percentage terms.
There is also such a thing as authentication . It is difficult to say unequivocally how it differs from verification. Some of our employees have an opinion that this is a certain process of confirming a biological (!) Personality, when it is difficult to separate the identification process from verification, i.e. This is a generalized process.

Voice verification.
I'll tell you about the verification, because it is more interesting for real use in the contact center than identification.

What is verification?

- Text independent
When identity is confirmed by the subscriber’s spontaneous speech, i.e. we do not care what the person says. This is the longest confirmation method - the subscriber’s clear speech should accumulate for at least 6-8 seconds. Usually this method is used directly during the communication of the subscriber with the CC operator, when the latter needs to make sure that the subscriber is exactly who he claims to be. The most interesting thing is that this verification method can be applied covertly from the subscriber himself. At the workplace of the KC operator, such a working tool is visible.

Figure 1. Part of the interface of the workplace of the KC operator for client verification.

- Text-based static passphrase
When the identity is confirmed by the password phrase that the subscriber invented at the time of registration. The length of the passphrase must be at least 3 seconds. Usually we offer to say your name and company name. Passphrase is always the same.
- Text dependent on dynamic passphrase
When an identity is confirmed by the password phrase that the system itself offers at the time of the call for verification, i.e. each time the password phrase is different! We usually offer a dynamic password phrase from a sequence of numbers. The subscriber repeats the numbers behind the system until she makes an unambiguous “one-on-one” decision. It can be one number of the type “32” or a whole set of “32 58 64 25”. Interestingly, the pronunciation of different numbers gives a different amount of information for the comparison: the most "useful" figure is "eight" - it most of all contains useful speech information, the most useless "two".

How does voice verification work?

Step 1.
In order to verify the voice, we need to have in our database a voice sample (voice cast), the owner of which is reliably known. Therefore, the first step is the accumulation of a base by cast of votes, for this we ask subscribers (clients) to go through the registration process in the system.
Registration in the subscriber's system means that he voluntarily leaves his cast of voice, which we will later use for verification. Usually we ask to leave 3 casts of voice in a row, so that there is a variation - to say your password three times. Then, when verification is successfully completed, we will replace the oldest voice cast with a new one, thus, there is a constant update of the impressions, if the subscriber often uses the system. So we solve the problem of aging voices.
If we apply verification using a dynamic password phrase, then we ask you to say the subscriber numbers from 0 to 9 three times. As a result, we will have 30 voice samples.

It is desirable that the client would leave his cast of voice (registered) via the communication channel through which it will be verified later, otherwise the probability of errors increases. There are cases when they are registered with the headset in Skype, and then verified by home phone - here the factor of the communication channel will play a big role in the reliability of the service. When building a service, you can take into account that the communication channels can be different - it is worked out and tested separately for a specific case and you can almost completely neutralize the effects of the communication channel. But without thinking about it immediately and with a swoop to introduce - there will be difficulties.

When should I offer the client to register? Then, when we have already confirmed his identity in other ways, for example, when visiting the company's office or when the CC operator asked 100500 different questions about the mother's maiden name.
We have a really working service (stand) on the phone, how to implement the registration mechanism for bank customers in practice, you can learn from this document .

It is important that the client independently and consciously pass the registration (know why it is needed and how it will help him later), because Then only a loyal subscriber who needs a result and who accepts the “rules of the game” can verify.
If the client is forced to pass verification to the place and out of place, then he can subconsciously change the voice, fool around (not be friendly to the service) - this will lead to errors and customer loyalty will fall, although he himself will be indirectly at fault.

How is the subscriber registration in the system? (static passphrase)

Fig 2. Scheme of registration of a person in the biometric system.

1. The subscriber calls the biometric system, which invites him to come up with and say the password phrase. Say 3 times.
2. The voice is processed by the biometrics server and at the output we get 3 voice models. One for each spoken password.
3. On the server, we get a client card (Yuri Gagarin) to which we attach the received 3 voice models.

What is a voice model?
- these are unique characteristics of a person’s voice reflected in a matrix of numbers, i.e. This file is 18Kbytes in size (for static pf). It is like a fingerprint. It is these models of voice that we then compare. In total, the voice model records 74 (!) Different voice parameters.

How to get voice models?
We use 4 independent methods:
- analysis of pitch statistics;
- the method of a mixture of Gaussian distributions and SVM;
- spectral and formant;
- the method of complete variability.
I will not undertake to describe them in detail here - it is difficult even for me and the course “for dummies” is definitely not included. We teach all this in our RIS department at ITMO (St. Petersburg).

Step 2.
This is the verification itself. That is, we have a subscriber at the end of the line, who claims that he is Yuri Gagarin. And in our database, respectively, there is a card of Yuri Gagarin’s client, where his voice casts are stored, therefore, all we need to do is compare the voice of the person who claims that he is Yuri Gagarin with the voice of the real Yuri Gagarin.

How is subscriber verification in the system? (static passphrase)

Fig 3. Scheme of verification of a person in a biometric system.

1. First we proceed as during registration, i.e. we have the password spoken by the client, which we send to the biometrics server and build the voice model “supposedly” of Yuri Gagarin.
2. Then we take 3 models of the voice of a real Yuri Gagarin, make the average model in a cunning way and send it to the biometrics server too.
3. Just compare 2 different models. At the output, we get the percentage of matching one model to another.
4. Next we need to do something with this number (in the figure 92%). Is it a lot or a little, can we unequivocally say that it is Yuri Gagarin or is it a deceiver?

Figure 4. The threshold of confidence "your / someone else."

In the system we have such a parameter as the “threshold of trust” - this is a certain percentage of compliance. Suppose we set it ourselves at 60%. Thus, if the percentage of compliance with the model of the voice “supposedly” by Yuri Gagarin does not reach the “threshold of trust”, then a deceiver called us. If there is more "threshold of trust", then the real Yuri Gagarin called us. We can set the “threshold of trust” ourselves, usually it is from 50 to 70% depending on the verification task.

Here I would need to tell you about the errors of the first (FR) and the second kind (FA), as well as the generalized error (EER), but I will not do it - this will greatly complicate and increase the text. If it is interesting, then I will try to persuade anyone from the scientific department to describe it popularly and place it here separately.

Let me just say that, depending on the verification task, it is more useful for us to be more likely to miss our “own” than not to miss the “alien”. And vice versa, sometimes it is more important not to miss the “stranger” than to miss the “one's own”.
I am sure that from the first time no one understood these two sentences from you, and you had to read them thoughtfully again to realize the meaning.

Integration of biometrics server in the contact center.

Figure 5. VoiceKey product block diagram.

Honestly, everything is very simple here: at the input we give a voice in the wave or PCM format via http, at the output we get the result of the comparison. I don’t want to dwell on this in more detail.

The verification process takes an average of 0.8 seconds. It is possible to work simultaneously with many streams.

On our site everything is described in detail, and most importantly there are well-developed use cases for contact centers. In recent years, I have talked a lot with various large CCs in Russia, first of all, this is the financial sector and an understanding of goals and objectives has been formed.

Now we will touch upon the following question: how generally is voice biometric technology suitable for mass use? Is it reliable?

In short, YES, it works really well. We have telephone demonstration booths in our company. If interested, then each of you can call and personally try, how and what works. I give the phone number and testing instructions on request from this page. Just for statistics of interest in this topic and assessing the load on the server.

For reference: the development of Russian scientists in the field of voice biometrics, if not the first place in the world, then exactly share it with others. This is confirmed by independent studies, such as NIST (National Institute of Standards and Technology, USA), where our company ranked among the top three in all five tests among commercial companies. Or the fact that our product “VoiceKey” won in the nomination “The Best Product of the Year for KC” in 2013 in the international competition “ Crystal Headset ”.
It may also be noted that our company owns the implementation of the largest in the world today project on voice biometrics in the telephone channel.

In short, here is an educational program. Ready to answer questions in the comments.

Source: https://habr.com/ru/post/205880/

All Articles

Speech technology. Voice biometrics for teapots on the example of work in the contact center

More articles: