Learning to listen to the robot talk

In manual mode, controlling all communications is a laborious task and, moreover, ineffective. And we decided to automate it. For this, we had to teach our Virtual PBX new tricks. We introduced the Text-to-speech technology for a long time, but now we have begun the reverse process.

Choosing a robot

From the very beginning, we understood that we don’t have the resources to develop our own voice recognition platform. Therefore, we decided to take the existing solution of one of the vendors and implement it in the Virtual PBX.

Making the choice was quite simple. After Dragon was offered to purchase not the Software Development Kit from them, but a ready-made service (of course, on completely different conditions), there were four candidates left: Google, Yandex, Microsoft, and the MDGs.

The main criterion by which we evaluated all these decisions is the quality of recognition of telephone speech, which has its own specifics. It should be borne in mind that this is a recording of speech, and not short voice requests addressed to the search assistant. The record has parameters dictated by codecs, channel, etc. That is why we stopped at the product from the MDGs.

Implementation difficulties

It took about 2 months to integrate the recognition algorithm with our platform. The main problem that had to be solved was to teach it to function faster. And if more precisely, then find a balance between the convenience of working with the SDK and performance.

The return of a large packet for recognition is beneficial in terms of resources (no need to spend a minute or two on preparation), but you can neither pause nor change the priority of the conversations sent to decrypt.

Strictly speaking, what the MDGs call the SDK is actually not quite what it is. This is a package of binary files and bat to run, which indicates:

what model of the recognizer to download (medicine, telecommunications, banks, etc.),
the folder in which to take audio recordings
the folder in which to put the decryption
format in which decryption is done.

And with the choice of formats is not thick: either something ini-shaped, or your own, in which the boundaries of each word are indicated, but there are no punctuation marks that are in the first case. Because of the need to run the bat-file, we encountered a problem: the larger the package, the higher the performance, but the lower the flexibility.

All parts of our platform (queues, dispatchers, switch management modules, etc.) are executed in the form of microservices in Python and work under RHEL. But the MDGs provided a solution only under Windows. Therefore, another task that developers had to solve was the implementation of microservice on Windows.

Everything would be fine, but it turned out that in the Windows service mode (this is how we start the resolver), it is impossible to get access to the NFS network resources on which we have recorded conversations, and what worked from the console refused to function in the service mode. To combat this, too, had to spend a development resource.

Compress cannot be recognized

The implementation difficulties described above are not the only ones. The task of managing the recordings of conversations also turned out to be quite difficult: we have to give our clients to download the records in a compressed format. This saves resources: both disk space and download time, that is, customers get records faster.

But for the recognition of compressed records fit a little more than nothing. Therefore it is necessary to make copies of conversations without compression. And to them - a separate algorithm for managing the storage window, which additionally imposes a security requirement. We set up backup storage, which is protected from falls and is used as a buffer in case of a major accident.

How much to hang in grams?

It was this question from the advertisement that arose in my head when we began to test the classification algorithm. It uses regression analysis for automatic markup. It turned out that for its quality work requires a significant amount of input data.

When training on a sample of 500 prayed conversations, accuracy rises above 80%. But up to values close to 95%, which is currently considered an acceptable level of technology development, the sample should be an order of magnitude larger.

Accuracy may be greater than 80% in a smaller sample, but it is already difficult to talk about the reliability of the result. Since there is little data for self-testing, and whether it is possible to trust this result is a matter of verification by practice.

Understanding perfectly well, it will not be easy for some of our clients to collect 500 conversations for training, we did not limit their number when setting up the system.
But this means that in small samples the quality of training is unpredictable even in the case when the self-testing algorithm shows a good learning outcome. That is, on a small sample, only a practical application of the classification algorithm will show its quality.

And it (the quality of automatic classification) is the most important parameter that determines the applicability of the algorithm in practice. Our customers expect quality to be close to 100%. Otherwise, this tool cannot help generate a reliable report on the quality of incoming calls.

Unfortunately, technology has not grown to this yet, but we are sure that a bright future is not far off. But with a quite achievable accuracy of 80–85%, the tool will be useful for operational control and troubleshooting.

For example, an average daily target label is automatically stamped for 40 ~ 50 calls. If suddenly in the evening there are barely 30 such conversations, this is a reason to figure out what is happening. Selectively listening to the calls, you can understand: whether it is the clumsy job of the classifier, or advertising leads not to those customers, or maybe there are other reasons, say competitors have reduced prices.

Having learned about the problem, it is easy to take measures and eliminate it, without waiting for critical consequences. Say, change the sales script or reconfigure the advertising campaign, which began to bring garbage calls.

If I had to do such a project again

Now we are testing the work of the classifier and are preparing to present it to customers. And if I had a similar task, I would have devoted more time to analysis.

I would suggest that developers take a closer look at possible complications when choosing a recognition engine. Yes, of course, we got a working solution, but it is foreign to our environment, which is why we spent a lot of resources to defeat Windows services and run the recognition algorithm in our piping, while also trying to get access to NFS.

Source: https://habr.com/ru/post/329456/

All Articles