Modern spam filters and end-to-end encryption

Hello

Trevor ( lane: - as far as I understood, it’s about Trevor Perrin ) asked to write your thoughts about ... spam filters and end-to-end encryption so that all the information was gathered in one message and not scattered throughout the forum . In particular, he asked me to dump my knowledge on the following topics:

How do spam filters now work in large mail services?
How will the wide distribution of end-to-end E2E encryption?
What can be transferred to the client (as well as the resulting pros and cons)?
Is it possible to do this with e-mail?
What will change when moving from e-mail to other asynchronous systems (for example, chat rooms) or new protocols; That is, spam problems - are email protocol problems or a global flaw in the system?

I will briefly describe my experience in this field in order to clarify the competencies: I ( translated: - page on Google+ ) worked at Google for 7 and a half years. Of these, I spent 4.5 years on the GMail security team, which is very tightly connected to the anti-spam team (they use the same applications, the same warning systems).
Somewhere in 2010, we gave a good response to spammers, as a result, they could not make money using the old methods. Some of them went to hacking accounts on an industrial scale using compromised passwords. Then spam was sent from these hacked accounts. I was the technical leader (tech lead) of the new team to combat account hijacking. We spent 2.5 years fighting stolen accounts. In early 2013, we announced our victory , and a few months later, Edward Snowden published information that the NSA / GCHQ had been listening to the security system we developed.
Since then, everything seems to have calmed down. We can say that from the point of view of GMail, the victory over spam was won ... at least for the time being.
If you prefer video, a few years ago I gave a talk at the RIPE64 conference in Ljubljana: ripe64.ripe.net/archives/video/25
In January, I left Google to devote all my free time to Bitcoin. Now I am working on a project of an application for P2P crowdfunding, which will make it possible to find funding using a decentralized structure.
So let's go.

Brief history of the spam war

In the beginning was ... a regular expression. Gmail does support regular expression filtering, but only as a last resort. It is easy to make mistakes with them. Once we blocked a letter from an unhappy Italian named “Oli via Gra dina” ( Per .: - saw a hint of Sildenafil?). Plus, this approach poorly supports internationalization and is easy to manage with randomization.
Then the Email community began to compile lists of bad IP addresses and share them. So Spamhouse appeared. Such an approach was paying off, because the resources for which spammers paid money were devalued. But fierce battles were fought around the lists, because the guardians of the black lists became judges, jurors and executioners in the flow of letters. It turned out that the question “What is spam and what is not” is very controversial. Many mailing lists did not consider themselves spammers, but in the absence of a clear definition, they were sometimes blacklisted.
To get around RBL (Realtime Blocking List), spammers began using botnets. In response, spammers built an Internet map and created a “Block Policy” (PBL, Policy Block List) - ranges of IP addresses that were tied to residential subnets, and therefore should not, in principle, send mail. Botnets generate incredible amounts of spam, but this is the easiest way to filter spam. During my time at GMail as a team of spam and security threats, it took very little time to combat botnets.
So, there are web-based mailers like GMail. The very first version of GMail simply used spamassassin. But this approach was quickly recognized as not good enough, and we built our own filter. The filter architecture in GMail was described in a 2006 scientific article: Sender Reputation in a Large Webmail Service .
I will briefly retell the essence of this article. The main technique of the new filter was a heuristic attempt to guess the domain of the sender of the letter (domains are harder to obtain and more stable than IP addresses), and then calculate the reputation for it. Reputation is a score from 0 to 100, where 100 is an ideal good reputation, and 0 is definitely spam. That is, if the sender has a reputation of 70, then about 30% of cases, we believe that the letter is spam, and in other cases we skip it. Reputations are a moving average calculated on the basis of a careful calculation of manual reviews using the “Report Spam / Not Spam” buttons and an automatic response from the filter itself. Obviously, manual complaints have a much greater weight for the system and allow the filters to self-correct.
This approach has another advantage - it eliminates all controversial issues around the precise definition of the concept of "spam". The new definition is as follows: spam is all that our users call spam. Against such a definition you will not trample. At the same time, it is very easy to implement in practice, and it adapts quite flexibly to the new notions of spammers.
It is worth noting a few points:

The reputation system must be able to read all the letters. It is not enough to see only spam, because the reputation will not be able to self-correct. The “Not Spam” button is just as important as the “Report Spam” button. Most of the “not spam” markers occur implicitly, when the “spam” mark is simply not put.
Reputation needs to be calculated quickly . If you received a letter with an unknown reputation, you have no choice but to allow this letter to pass. This encourages spammers to try to get ahead of the training system. The first version of the reputation system used MapReduce and calculated reputations in batches. The delay was calculated for hours. As a result, it was replaced by an interactive system that calculated points on the fly. This system is an incredible, impressive piece of engineering skill. It is, in fact, a global peer-to-peer learning real-time system. There are no central nodes. The filter is distributed around the world and can survive the loss of several data centers.
I am afraid to think about how to build such a system outside a well-controlled environment. Even within the framework of a proprietary / centralized environment, I had to pretty much break my head ...
Reputation is distributed between domains. If we know that a specific link is bad, and it appears in a letter from an IP address with an unknown reputation, then this IP address also receives a bad reputation. And vice versa. It turned out that this is an important point. As the number of criteria for determining reputation grows, it becomes more and more difficult for spammers to change them all at once. This is especially true for botnets, where precise control of sending machines is difficult. If the spammer fails to randomize even one micro aspect in all their letters at the same time, all their links and IP addresses will automatically be compromised and they will lose money.
Reputation has inherent problems. You need a large number of users. Therefore, accounts must be free. If they are free, then spammers can register many such accounts, mark their own letters as “not spam” and produce a Sybil attack . And this is not a hypothetical problem.

The reputation system was designed to calculate reputation based on a number of features in letters other than the sender’s domain. One of the features is the domains of clickable links in the text. Links have become a critical battlefield for which battles have been actively fought for several years. The reason is clear: spammers need to sell something. So they need to bring the user to your store. No matter how they call their product, the link to the final site should work. The battles were as follows:

It all started with simple links in the HTML code of letters. Filters began to block emails with such links.
Spammers began to obfuscate the links and asked the users to manually compile and enter the link in the address bar. But this method worked poorly. Most users did not want or could not do it. Revenues fell.
Spammers began to buy and create random domains in batches. Top-level domains, such as .com , are expensive, but there are others - cheaper. And the reputation of a separate top-level domain fell below the baseboard (for example .cc )
When the registrars began to tighten the screws, the spammers ran out of top-level domains. They began to pursue theft of reputations . For example, they created blogs on sites that made it possible to register a domain: *.blogspot.com , *.livejournal.com and others. Abbreviated link services have become the best of spammers. Literally, each URL shortening service became the battlefield of operators against spammers for domain reputation.
Spammers started hacking websites. But this approach did not always work well, because a rare web site could offer legal mail with a good reputation. And it is also a good source of passwords.
Large content hosting sites, such as Google, combine a spam filter with the hosting engine. And as soon as the reputation of the user URL falls, hosting for it is automatically closed. The first versions of such systems were too slow. One of my projects at Google was related to building a real-time system for automatic removal of such content.

Between 2006 (registration opening) and 2010, a spam filter was built when registering accounts. We have done a very good job, despite the fact that I praise myself. Look at the pricing of "free" webmail mailer accounts on buyaccs.com (Russian store account). Note that the accounts on hotmail / outlook.com cost $ 10 per thousand, and GMail is already much more expensive. When we started, the GMails cost $ 25 per 1000 units. And we managed to increase the price 4 times. Further, it is already difficult to improve performance, since all large web sites use phone number verification to exclude false positive registrations, and at the current price level it becomes profitable to buy SIM cards in large quantities.
To deal with the massive registrations used a large amount of magic. For example, I created a system that generates randomly encrypted javascript that counteracts reverse engineering attempts. This script is able to identify automatic registration programs and mows them [1] .

How will pass-through encryption affect all of this?

From my stories above we can draw the following conclusions:

Large amounts of data are really important both for blocking spam and for defining good emails.
The response speed of the system is important. Many spamming battles boiled down to “who is faster.” If your reputation is determined in 3 minutes, then you are overtaken.
It is important to patrol your users. Reputation cannot be calculated if there is no trust in user actions. This creates a theoretically paradoxical situation: free accounts still cost money (if you need a large amount of them).

The first problem with E2E cryptography is that the reputation database requires data from all letters. We can imagine an email client that decrypts and analyzes the letter and then sends a “good / bad” report to some hypothetical central repository. But in the end, this central repository will study not only information about who you communicate with, but also links in letters. This is extremely valuable information. The more factors you have to analyze, the more acute this problem becomes.
The second problem is that if the central repository cannot read your letters, then it cannot be sure of the veracity of your reports. In the case of unencrypted emails, this problem is not worth it, because the spam filter itself extracts the necessary information from the letters. If spammers want to beat the system, they still have to send real letters to themselves, which leads to an increase in value. In a world where spam filters can’t read letters, spammers can freely send completely made-up reports about “good letters”. Everything is even worse, because spammers can start to compete and send false negative reports. Something similar we saw with our AdWords system.
The third problem is that spam filters rely heavily on security through ambiguity ( lane: - the very “security through obscurity”), because it works well. Some of the factors used in the analysis are widely known (for example, the IP address of the sender, links), but there are many others covered in secrecy. If the filtering logic is transferred to clients, then spammers will be able to see what exactly they need to randomize in order to confuse the end-to-end reputation system.
Perhaps these two problems can be solved with the help of trusted computing (Trusted Computing). With their help, you can run encrypted programs on personal data and the hardware can “prove” to the central server that the program was actually running. But it will be difficult to combine security through vagueness and end-to-end encryption - if your letter passes through a black box, this box can theoretically steal the contents of the letter. You will have to rely on something that will calculate secret criteria based on your messages. Then why not just trust GMail today?
The fourth problem: anonymity and spam filters are not well combined. In essence, it is necessary to cut off spam at the root at the point of sending the letter. Destroying accounts is a fundamental tool in the fight against spam. All major web mailers and social services force users to pass a phone number check if the security filter closes the alarm. Usually a random code is sent in an SMS message or a telephone call is made to verify the reality of the user. This approach works because phone numbers cost money, and almost all of us have at least one number. But in many countries it is forbidden to have an anonymous phone number, and operators are forced to check identity documents before selling a SIM card. The fact that you can be “punched through the base” with complete impunity ( translated: - there is such a legal term “plausible deniability”) means that even if you do not submit your personal data during registration, the government can force you to open your location and / or person at any time. To do this, do not need to do anything special. If they can intercept your password, they may be suspicious of the site’s security system, wait for the user to enter a phone number, and extract all the metadata they need (I have never encountered such situations, but it is theoretically possible).
And the last problem: spam filters are demanding of CPU resources and disk storage. Many users today work with the mail exclusively through mobile phones. Smartphone resources are limited, and the harder they load, the faster the battery will sit down. It just takes some battery power to simply turn on the radio and download the message. Even if you try to run on your phone outdated ways of combating spam from the 90s, the phone is likely to be doomed. It can save only some revolutionary breakthrough in battery technology.
As a result, I do not see a realistic way to return to filtering spam entirely on the client side.
')

What happens if everyone moves from email to other messaging systems?

In general, SMS spam is a good example. There is not a lot of it, because telephone companies act as spam filters. The government also tries to participate by introducing penalties for SMS spam in order to discourage future violators. So to say, send a message to potential criminals. Email spam survived the boom long before the government began to respond to it. Therefore, it is interesting to observe the difference in the approaches of these two systems.
It does not seem to be applications like WhatsApp that suffer from spam. But I think that this is more like a demonstration of the good work of their anti-spam / abuse department. They are in the best position. It is easier to fight a million times when there is a single center from which everything can be controlled and changed at any moment. You can kill accounts and control the flow of registrations. Without a single control center, you have to rely only on incoming filtering and suffer silently if spammers find how to bypass your protection. In addition, you usually just do not control customers.

General thoughts and conclusions

If you look at how the war against spammers was won, we will see incredible efforts made over the course of several years. An analogy with the war comes to mind: there were two opposing sides and many interesting battles, clashes of tactics and weapons. I could continue to bait bikes all day, but then this letter will stretch a lot.
Trying to replay this war in the context of total encryption will be like trying to fight blindfolded and handcuffed. You crush for a minute.
Therefore, I think that we need a fundamentally new approach. The first idea that comes to mind is to introduce a fee for sending letters. But this is a bad idea for several reasons: the most obvious, free global communications is the greatest achievement of humanity, comparable to the delivery of man to the moon. A person from rural China can send me a letter in a few seconds, for free, and I can answer, for free! Think for a second.
Another reason for failure is that the fee for letters erases the difference between spammers and honest mass mailings. Many companies send large volumes of letters that users are waiting for. Take, for example, Facebook. If every letter was worth the money, some honest and helpful companies would not be able to work.
Another approach is to make some cash deposit. There is a protocol that allows you to donate part of bitcoins as a commission to miners. That is, you can prove that you spent money signing a call to a box that did the same. This will allow you to very accurately legalize anonymous mailboxes, from which you can then send as many letters as you like. There is a way to calculate reputation. Only spam proofs can be stored in spam / spam reports. And based on these reports, then determine the value of reputations. Letters whose sender does not yet have a reputation can be held until volunteers check them. Another option is to allow cross-signature. A member with a good reputation can temporarily certify a letter in order to raise its reputation and cause a reciprocal increase in reputation. Such a trusted participant can verify the authenticity of the sender in any way you like.
For this reason, I am interested in the project on the bitcoin and E2E message joints. I think these are fundamentally related things.
To summarize I am known in the Bitcoin community for my radical ideas. For example, I suggested that there is a trade-off between privacy and malicious behavior. Many people in cryptographic communities passionately reject this idea and (unfortunately) the person who dared to express it. I hope that my stories, described above, show how I came to such conclusions. I think that striving for perfect privacy without taking into account the abuse of such privacy is a bad way for any system that wants to achieve widespread adoption.

Source: https://habr.com/ru/post/237745/

All Articles