Everyone who posted their number on the Internet, filled out a dubious questionnaire offline or simply was not lucky enough to get into numerous databases was familiar with telephone spam. Today we will tell Habrahabr readers about how, with the help of user feedback and machine learning, we taught the Yandex application to warn about unwanted calls.

Calls from unknown numbers are always a difficult choice. Does this call a long-awaited courier or another operator with a “unique” promotional offer? To solve this problem, there are mobile applications that work on the basis of directories of well-known organizations. In part, they solve the problem. But the most aggressive spammers, dubious collectors and intruders do not fall into such bases. What to do?
')
The idea of ​​creating our own determinant of numbers came to us by chance. Attention attracted one of the employees of the company, who carried with him two phones. When the main phone was called from an unknown number, he entered this number in a search engine on the second device and looked for reviews on the network. This method can hardly be called convenient, but we were inspired and decided to automate it a bit. We collected the first prototype for Android, which did the following: during an incoming call, a window opened with a webview in which the search results were loaded by the incoming call number. Fine! We managed to save on one phone. But seriously, despite the simplification of the routine, there was little benefit from this.
Try to drive in any phone number in the search engine. You are guaranteed to find sites that hint that they have reviews on the number. But if you click on the result, in most cases it will turn out that the site simply generated pages for all possible numbers, but the reviews themselves are not there. Search in such conditions information about incoming calls is too long and inefficient. The only way to do well is to find the answer right away. But this requires data.

Yandex has a directory. This is a knowledge base about organizations, which is updated by both companies and users. From there, information about organizations is taken when they are searched for in Search or Maps. When our internal prototype of the identifier for a mobile device first moved from a simple issue to a verdict, the data was pulled from the Directory. But this was not enough: too often they call from numbers whose affiliation to certain companies is not advertised. To overcome this problem, you need to collect additional feedback from users who have called from these numbers.
We started with a simple. Since last summer, Yandex search offers users to leave feedback on the phone number they are looking for in a search. A plain text field for free-form feedback. We did not limit the response to specific response options, because we did not fully understand the variety of sources of unwanted calls. The problem is that the analysis of reviews in a free form is quite difficult to automate. But we avoided these difficulties with the help of the Tolok crowdsourcing platform, whose users helped to disassemble and classify reviews.

So we began to collect data not only about well-known organizations with a relatively good reputation, but also about spammers, fraudsters, aggressive collectors, prankers, and even about amateurs silent in the phone. Although not all categories could be easily recorded in unwanted calls. For example, calls from courier services are usually useful.
The Directory data and the first user reviews formed the basis of the number identifier for Yandex, which last year launched in the Web version of the Search. Yandex began to respond with verdicts to many requests containing phone numbers.

Soon the early version of the number identifier was built into the Yandex.Maps application. It worked only on the basis of the Directory, since reviews in other categories were still not enough for quality work. This led us to the next stage in the development of the determinant. It is necessary to collect reviews on a mobile device and immediately after calls from unknown numbers, and not wait for them on the web. But how to do that? The first internal attempts to collect feedback after any call led to problems. Too frequent requests annoy users. Moreover, if any user can leave feedback on any incoming call, it provokes and simplifies the cheat. It was necessary to act smarter.
Yandex specializes in machine learning. With it, Search builds the issue, the Browser detects malicious sites, and Music recommends tracks. Machine learning allows us to identify non-obvious patterns in the analysis of a large number of heterogeneous factors. Therefore, we applied it in the new version of the number identifier, which now works in the Yandex application for Android. Our technology based on the
CatBoost library analyzes more than two hundred factors when deciding to request a recall. For example, the frequency and duration of the call. For obvious reasons, we will keep silent about the other factors, but this solution has reduced obsession and made the reviews as difficult as possible.
A few words about how it works now. If the user of
the Yandex
application has enabled the determinant in the settings, then when calling from unknown numbers, a request is sent to our cloud, from where the verdict is returned.

By the way, the verdict can be viewed for missed calls. This is convenient when you do not know whether to call back.
If Yandex doesn’t know exactly where the call is coming from, then at the end of the call, the user can see the request for feedback. The likelihood of this request is just depends on the analysis of all factors in the cloud.

Now we are collecting new reviews, which will inevitably affect the development of number identification technology in the future. If you have experience in creating such systems, or you see an alternative solution to the problem of telephone spam and other unwanted calls, we would be interested to discuss it. Thank.