From blacklist to machine learning. Antiphishing in Yandex Browser

Malefactors specializing in the theft of passwords, bank card numbers and other personal information appeared in the last century and since then their number has only grown. According to a Kaspersky Lab report, from 9% to 13% of their users in Russia face phishing. Phishing and other forms of identity theft damage $ 5 billion each year, according to Microsoft estimates. This is generally consistent with our observations and explains why any more or less popular browser has phishing protection based on “blacklists”. In Yandex Browser, it is also there. It would seem, why invent something else?

Safe Browsing

The most obvious solution for protecting users is to use a ready-made database with a list of phishing sites. We check on the "black list" visited pages and warn if there is a match. This idea is based on protection using Safe Browsing technology, which has been working in the Yandex Browser since its inception.

A little about how it works. The browser regularly updates the list of bad sites weighing several megabytes. In fact, there are a lot of dangerous sites, and the compression ratio is limited; therefore, instead of explicit addresses, we locally store only prefixes (that is, the initial part) of their hashes. Visited sites are checked on a local database. If a match is found, then the prefix is sent to the server, in response we get full hashes, recheck, and if there is a match, we show a warning. The chain looks long, but works in a split second, does not produce requests and, most importantly, protects the user.
')

Safe Browsing lists are updated using Yandex search and antivirus technologies, the details of which should not be disclosed for obvious reasons. However, third-party developers can also use the results in their products (including browsers) using our Safe Browsing API .

Protection using lists of bad sites (whether it’s Safe Browsing Yandex, Google or other analogues) has long been the only method used in the browser industry. The problem is that modern phishers are not as slow as they used to be. Creating fake websites, publishing them, sending spam through social networks has all been automated for a long time. While the new phishing page reaches the full base, then it can easily manage to harm someone locally. We needed to learn how to deal with the problem in the absence of accurate knowledge.

Password protection

Attackers using phishing actively steal passwords from banks, payment systems, social networks and even server management admins. How to protect them, if the browser does not yet know, good or bad site is open in it? Warn each time you enter a password and ask to make sure that this is the very site? This is not just intrusive, but also useless in the future. If the user confirms 100 times that the real Sberbank site is in front of him, and not a fake one, then he simply will not check the site for 101 times, which by the law of meanness will surely turn out to be fraudulent.

By the way, there is a common misconception that two-factor authorization on banking websites will save you from stealing money, even if a person has fallen for phishing. Saves, of course, but not always. In our practice, we came across examples of dangerous sites that, after entering their login and password, were able to initiate sending an SMS by this bank. The user entered the code from the SMS on the phishing page already opened, and the attackers used it, getting full access to the personal account. But we digress.

Initially, the idea was quite simple. You need to look after the passwords already stored in the browser. If the user enters a password on the site, which is clearly not the same as the site from the password manager in the browser, then you need to stop and warn it. The problem is that not all are using the built-in password manager. Even ordinary users who have never heard of LastPass, KeePass or 1Password, do not rush to save their passwords, often preferring to enter them from memory or from a notebook (paper, not from Windows). Moreover, it is this category of users that is most vulnerable to phishing, which means that such a simple solution was not suitable.

There was no point in using the already saved passwords, but instead of abandoning the whole idea, we taught the Browser to memorize the hashes of the passwords entered. Why hashes? Because they are quite enough to compare passwords, besides storing hashes is still safer. Of course, we gave the option to disable the function for those who do not trust the hashes. So, if a user entered at least once, for example, a real Alfa-Bank, the Browser warned him when trying to enter a password on phishing copies. It would seem that one could go drink champagne, but not everything is so simple.

The memory of users does not obey Moore's law, so many people prefer to come up with one password for all sites. This is terrible in terms of security, but that is the reality. If we included password protection for all users for all sites, we would have invented not only good protection against phishing, but also a great way to scare away the audience. Therefore, by default, protection was enabled only for the most popular sites among fraudsters. For any other, you can turn it on manually.

This feature was introduced about a year ago, and all this time, it not only protects against phishing, but also draws people's attention to the topic of password security. Here are just passwords - this is not the only kind of confidential data that people love to steal.

Card protection

To steal money, it is not necessary to steal passwords from online banks and think through logic with circumvention of two-factor authentication. You can simply steal bank card data. You don’t need to remember about the optional 3-D Secure either - the user will not forget to enter the CVV code on the phishing page. After the card data is stolen, all that remains is to figure out how to get the money out of there. Ways are different. For example, someone sells tickets to tourists with a 50% discount, in fact buying them from a stolen card at full value. With varying success, such operations can be challenged in time through your bank, but it is better not to bring up and protect your bank card details.

Unlike password protection, where you could uniquely control the password-site pairs, bank cards can be used anywhere. We can control large sites, but we still can’t cut the long tail of online stores. And what does “control” mean in general? Do not give to enter the card number? If you warn, then what? Realizing that it is hardly possible to make an unequivocal conclusion about the bad intentions of the site at the browser level, we looked at the situation from a different angle - from the point of view of encryption.

The presence of an SSL certificate is a prerequisite for any website that works with confidential user data, especially bank data. If a resource asks to enter a card number, but does not support protection and works via HTTP, then two different problems are possible at once. First, someone can intercept your data on the way from the open traffic. For example, through an unprotected Wi-Fi point in a cafe. Secondly, the owner of such a resource at least does not care about the safety of its visitors, and, perhaps, simply steals data. In any case, enter the card number on this site is not worth it. If we still somehow solve the problem with interception with the help of the Wi-Fi protection function, then channel encryption will not save the scammer. More precisely, it will save the data from the scam interceptors and deliver them in integrity to the fraudsters-phishers. And here it was necessary to do something.

So, we have localized the problem. If a user visits an HTTP site that asks to enter a bank card number, then this is a reason to warn. But in order to show a message, you need to first recognize the input card. No one has yet invented the special bank type of the input tag, and few people use the relatively fresh attribute for browser-based autocomplete = cc-number . The Chromium team, of course, does not give up the idea of teaching the browser to substitute card numbers on its own and even introduces a heuristic that guesses by field names and some other data, but this does not work everywhere. In general, the analysis of input fields is not an option. But then we can catch the input numbers. For example, if the user entered 16 digits, then we can assume that this is a bank card. The problem is that this is not always the case. Fortunately, there is an algorithm for the moon.

I think many people know that the last digit in the card number is needed to verify the correctness of the entire number. And the test itself can be easily carried out using the Moon algorithm. It is quite simple. In each pair of digits of the card number, the first number is multiplied by 2. If after multiplying the number becomes more than 9, then you need to add the composite digits. And then add up everything. If the total amount is a multiple of 10, then we have a bank card number. With an error rate of 10%.

The algorithm of the Moon reduces the likelihood of false positives at times. But there is a cheap way to reduce the error a little more - to control the first digits in the room. It is at the beginning of the number that the payment system supporting the card is encoded. If at the beginning is the number 4, then this is VISA. Something from the range 51-55 is MasterCard. 34-37 - this is American Express. Similarly for some other systems. The probability of error, of course, always remains, but already at an acceptable level.

We taught the Browser to recognize the input of a certain number of digits (from 15 to 19), to check them using the Luna algorithm and for compliance with the codes of known payment systems. And it all works completely locally - the browser does not send or store the card number. If all conditions are met, the user sees the following warning:

We show the same message for a number of other dangerous situations. For example, if the site itself is HTTPS protected, but the number is entered in the HTTP frame. Or if the site certificate is not valid.

There are situations for which, due to widespread occurrence and relative safety, you should not show a warning, but you still need to give opportunities to users. For example, if the form for entering the card number is in a frame on another domain (both the site and the HTTPS frame). This happens all the time, because there are many online stores, but not all of them are able to work out their own payment module, preferring to build in frames of popular payment systems. Or another example. The site does not use encryption, but the card accepts via an HTTPS frame on its own domain. For such situations, the Browser does not show a warning, but adds a map icon to the address bar. If you click on it, you can find out who exactly you trust your data.

All of our above protection revolves around the availability of an SSL certificate. This is justified, because users in the mass have not yet become accustomed to paying attention to the lock in the address bar, and phishers do not have the motivation to use certificates. But gradually everything changes. Install a free certificate from the same Let's Encrypt is no longer a problem. So, sooner or later we will return to a situation where it is necessary to protect somehow, but there is not enough data on the client. And in order not to lose phishing sites in the future, we began to prepare now.

Machine learning

Any site on the Internet has a set of characteristics by which it can be assessed. For example, audience size, lifetime, SSL certificate, its reliability, or even the uniqueness of the address (phishers like to use the most similar addresses). And our and you confidence in this or that site is largely determined by them. An experienced user, looking at an unknown site, can decide for himself whether this site is credible. With the computer all the more difficult. The task of determining "suspicion" is difficult to formalize and does not fit into simple algorithms. It is clear that a gross error in HTTPS is a strong criterion, but I am talking about far more obvious cases. And here, without machine learning is no longer possible.

Yandex has been using machine learning for years. Our technologies are used not only within the company (Search, Music, Market, Zen ), but are also available to external customers through Yandex Data Factory . It is machine learning that allows a computer to demonstrate behavior that was not explicitly incorporated into it. And for our task - to warn users when paying on suspicious sites - fits perfectly.

To train the car to search for suspicious sites, we must show it examples of obviously bad sites. We have no problems with this - thanks to Safe Browsing technology. On the other hand, we point out to her the characteristics (factors) already mentioned above that are worth paying attention to. And further, our machine learning method, Matrixnet, learns to derive patterns and build formulas that can feed the site address, and get a verdict at the output. As simplified as possible, it looks like this:

Among all the factors I would like to single out one especially. Regular users, who more often than others are victims of phishing, are guided primarily by the appearance of the site and do not always look at its address, lock and other details. Attackers and use it. A distinctive feature of most phishing sites is copying the design of popular resources. Therefore, using computer vision technology, we taught the machine to compare the appearance of pages with samples of popular sites. If she finds a match, then this is a strong signal of a possible threat.

The results of machine learning and computer vision are available to users of Yandex Browser, starting with version 16.9.1. If the user enters the card number on the site, then the server sends a request indicating the page. If there is a risk, the user sees a warning.

It may seem strange that we show exactly the warning and only in response to the input of the card, and we do not use the full-screen lock immediately upon loading. The reason is that the machine is learning to identify sites where there is a risk of losing money, and it would be wrong to block access to all information. In addition, the probability of false positive verdicts is never zero.

If you have read to the end, you already know that phishing can (and should) be fought with the help of completely different technologies. Unfortunately, not everything depends on them. The knowledge and experience of the users themselves largely determine their vulnerability to attackers. But we believe that if you draw attention to the problem, talk about threats at the warning level, explain using Browser, why it’s important to use different passwords and do not enter card numbers on websites without encryption, then eventually people will start to be more careful with their work online.

Source: https://habr.com/ru/post/309808/

All Articles

From blacklist to machine learning. Antiphishing in Yandex Browser

Safe Browsing

Password protection

Card protection

Machine learning

More articles: