“The subscriber is not a subscriber - please leave your message after the beep!” - we hear this automatic reply many times and are already used to hanging up, knowing that no one ever checks “voice mail”. I, like all of my interviewees, would not even be able to check it without Google! Why do operators need this weird thing? And in order to take money for calls that would otherwise be free. And not only from ordinary subscribers, but also from companies that use automatics for calls. Imagine a store that confirms orders not with a call center in half an hour, but with a robot in ten seconds. And some of these calls "go" to voicemail, spending the company's money and breaking statistics. Under the cut - a detective story about early media, big data, machine learning and TensorFlow.
What kind of "free calls"?
Telephony is already a rather old area, with many “historically formed” pieces and technical solutions twenty years ago. For example, monetization: operator “A” pays operator “B” for the duration of the call to the telephone number served by operator “B”. "All incoming calls are free!" - this is from here. Operators receive money for calling their subscribers. I remember, there used to be even tariffs, where they paid extra for incoming calls!
Such a solution has pros and cons. If incoming and outgoing calls are roughly equal, then "no one owes anyone." More incoming calls - the operator makes money. More outgoing - spends. Operators want to make money, so they are trying with all their might to maximize incoming and minimize outgoing. One of such mechanisms for minimizing expenses is the agreement on “Early Media”.
Early Media - when the subscriber is not a subscriber
What happens when the subscriber "A" from his cell phone calls the subscriber "B", who also has a cell phone? A lot of things happen, but if you simplify as much as possible, then the operator “A” using the text protocol SIP sends the operator “B” a call request, and he starts looking for the subscriber “B” through the towers (actually via SS7 over PRI, but let's not talk about sad) So that the “A” subscriber didn’t have silence on the phone at that time and it was possible to sell all kinds of “replace the dial tone”, the operators agreed on the state of “Early Media”: while the operator “B” is looking for his subscriber, he can reply via SIP “early media "And start transmitting audio via RTP. Hooters, music or "sorry, the subscriber is not a subscriber." ')
The operators also agreed that the “early media” will not be charged as an incoming call, operator “A” does not pay operator “B” for this music or beeps. And so that no one cheats, they also agreed to give the sound in the “early media” state only to the side of the caller and cut off such a call after 60 seconds. Although with such limitations there are craftsmen doing something useful in the early media on the “free” 8-800-, but this is a separate story. And our story about voice mail.
Voicemail as an "honest" way to take money
If the operator did not find his subscriber, he did not earn money on the incoming call. Telecom operators, like any commercial organizations, love to earn money, so a brilliant voice mail was invented. The phrase “leave a message after the signal” allows the receiving operator to “accept” the call even when the subscriber is not available. Honestly write down 20 seconds of silence somewhere and, most importantly, take money from the calling operator for this. The most cunning ones do not even wait for “piiip” and immediately take a call - what is the money to lose?
What a man can do to a robot is a sadness
Cellular subscribers voice mail, as a rule, no way. For me personally, there is no difference, the phone will say “the subscriber is temporarily unavailable” or “the subscriber is temporarily unavailable, leave your message after the signal”. I, like all my friends, will hang up the word "unavailable". And what a single operator will pay a penny to another for such a call is not very interesting to me.
It is quite another thing if I am Voximplant and on the basis of our platform automatic order confirmation is made in the online store. Early media is also free with us, but for voice mail money will be withdrawn from the client’s account at the rates of the operator on whose phone the call was made. The amount itself is small, but we multiply by thousands or tens of thousands of calls per day - and not so small anymore.
But the automation is not limited to “call after the buyer has pressed the“ buy ”button on the retailer's web page and suggest that you click on one or say“ I confirm ”to confirm the order”. There are automatic notifications about, for example, a concert ticket. Statistics show that the subscriber had a call and he listened to the message - and in fact the message was “listened” to the voice mail. Or even worse: automation calls around customers to, for example, discuss the conditions of the ordered house cleaning. She synthesizes the client “hello, this is a robot of such a company, calling about the ordered cleaning, connecting with the operator,” the operator synthesizes “phoned to such and such client” and shows the order card in CRM, and then the operator talks for 20 seconds with silence in voice mail
First attempts to determine voice mail
We have been engaged in automating telephone and video calls for a long time, so the task of determining voice mail began to be solved several years ago. What do all voice mails have in common? They all have “pi-and-and-and-and”, which is between “leave your message after the signal” and transferring the call from “early media” to “accepted”. The bad news is that “n-and-and-and-and” is different for everyone. One beep, several, at the same frequency, at two, of different duration and frequency. Moreover, the operators love this “pi-and-and-and-and” to change from time to time. I wonder why? ..
Our first implementation used the Goertzel Algorithm to calculate the “carrier” frequency and the heuristic in order to recognize the voice mail sound when the frequency appeared in the audio stream. Alas, this method, although it worked, had serious flaws. If the operator changed the sound signal pattern, then the heuristics “broke” and we needed to manually update it under the new pi-pi-pi-pi-pi-pi. False positives were much worse: “tricky” signals at two frequencies were difficult to distinguish from the human voice and showed voice mail where a real person actually responded. Customers wanted reliability.
Deep Learning. Deep Learning Everywhere
Having failed with the usual math, we decided that we should try to multiply the matrices. After all, this is not just math, but Deep Learning and Artificial Intelligence! TensorFlow was installed and work began to boil: the recordings of conversations and voice mails were fed to different models in the hope that they would find patterns that were invisible to us: characteristic time delays, even intonation, a certain set of words, all this.
The very first problem happened with the data: even a few seconds of a voice with a “telephone” frequency of 8 kilohertz are tens of thousands of values. And the more complex the data on which we train the neural network, the more this data is needed for an adequate result. To train a neural network on "raw" data, we would need labeled entries of millions of calls.
Therefore, the data needed to be processed. We connected to Python specific telecom libraries written in C / C ++ and implementing the logic of working with voice: noise reduction, echo cancellation, carrier extraction and many others. After processing, the record turned into a set of parameters, on which the neural network was already trained.
The result immediately became much more fun, and for the next six months we played IT alchemists: we selected a model, processing options for the input data and the results of applying the model, so that, after a few seconds of recording, voice mail was determined. The result was very good - now it is enough to start the conversation without emotionally with the phrase “The subscriber is temporarily unavailable” to receive a notification that the voice mail is most likely on the other side of the handset. And what to do next with the information received, each client decides for himself in cloud-based JavaScript. For a programmer, using a detector looks like this:
Machine learning is a good thing when a task is difficult to formalize with “ordinary” mathematics and if-s. But get ready to play the alchemist: preparing data, choosing a model for a neural network and interpreting the results are areas where there is little “best practice” and you can spend months, if not years, on selecting a working solution.
And you need the marked data. Many marked data. A lot of marked up data. But this is a topic for a separate post.