📜 ⬆️ ⬇️

Unfinished article on spam

It just so happened that I had to deal with the problem of spam. Here, in fact, what happened to figure out. A lot of text, mostly of a general nature.

Spam With this short word, many have unpleasant associations, and some
System administrators even shudder. I think in our time every
computer user met with spam and knows firsthand what it is.

So what do we call spam?
')
Spam is sending messages that users do not expect to receive, simultaneously to a large number of recipients.

Mass is the main feature of spam. Tellingly, spammers are not original in this approach. They only follow the path that has been worked out by nature for millions of years of evolution.

A simple calculation shows that each cedar tree gives an average of one million viable seeds in its lifetime. However, the number of trees on our planet is not increasing, which means that the level of natural selection is about one in a million. It is this level that ensures full reproduction and stability of the species. (http://kedr.forest.ru/culture.html)

For example, in mackerel (Scomber scombrus), about 99.9996% of individuals perish during the first 50-70 days of life of the larvae. Thus, out of a million swept eggs (and one female mackerel produces at a time up to half a million small eggs that swim in the water column) only a few individuals can live to maturity. Nevertheless, mackerel remains a very ordinary fish, as evidenced by its presence on the shelves of fish stores. (http://elementy.ru/news/430696)


With one row with them, we can deliver spammers who, in an effort to deliver the letter to the recipient, rely on the mass character. And, most sadly, this approach is justified.

According to reports, the percentage of users who made purchases advertised in spam messages (for example, via the link “Click here”, which is found in most of the messages), in 2005 was 11% (http://www.technewsworld.com/story /44655.html), in 2006 - about 6% (http://www.yale.edu/its/email/spam/whyspam.html), and in 2008 - 29% (http://www.marshal.com /pages/newsitem.asp?article=748&thesection=news).
This is an absolutely unbelievable number, but considering that only 622 people were interviewed in the last study, the results are likely to be far from accurate. But even if this result is too high, it’s still hundreds and thousands of people who every day prove that spam generates income for its owners.

On the other hand, there are numbers of a completely different nature (http://habrahabr.ru/blogs/spam/44353/) - “The real CTR of spam is 0.000008%”. It is also probably not the whole truth of life. But the income is still there.

From this point of view, the only way to beat spam is to not respond to it. Or make sure that spam messages do not reach those irresponsible recipients who still respond to them. What is at this stage in the development of our civilization is, alas, impracticable.

What is spam (http://ru.wikipedia.org/wiki/Spam)



Any spamming list has a specific purpose - otherwise the cost of it
just lose all meaning. Depending on the purpose, the content of the letters also changes.

The vast majority of spam messages are advertisements (for comparison, phishing emails or emails with viruses add up to no more than two percent of the total number of spam).

Among the most popular advertisements (statistics for September): Spam "for adults" - 28%, "Medicines; goods and services for health ”- 19%,“ Education ”- 12%,“ Replicas of elite goods ”- 6%,“ Leisure and travel ”- 6%.

Advertising the actual spam service has recently floated around the 5% mark.

See what spammers say about spam themselves:
"Some people think that spamming is an unethical method of advertising, but how
practice shows that many, at least out of curiosity, look through letters of such
kind of. And surely among them is a potential client,
interested in the offer. Since spamming is enough
inexpensive service, many customers, assessing the effectiveness of such a lever
impact on potential consumers, become regular customers and
recommend this method of advertising to other colleagues. "(http://www.direct-mail-reklama.ru)


Spread



Technically, spam is distributed mainly by email. The disadvantages of mail protocols developed at the end of the last century, the simplicity of implementing mass distribution software, and the availability of address databases provide a wide range of activities here. For example, any beginner can read the spam FAQ forum.antichat.ru/thread58130.html or get a more or less tolerable answer to the interesting question forum.antichat.ru/thread72829.html .

Actually, mailing in the process of technical progress also changed - from direct mailings by spammers themselves to hacking users' computers and creating specialized botnets (http://www.viruslist.com/en/spam/info?chapter=156608519)

Recently, spam on IM, on forums and blogs have spread. For example, on www.xakep.ru there is a note titled “Microsoft ranks 5th in the list of the most spam-tolerant providers.”

An example with classmates. Personally, I received a message from a classmate.
Hi, help please, there is no money in the account (vote for me please, send SMS to number 3649 with the text "XX 222761" (10 rubles worth)


In addition to various carriers, spammers demonstrate astounding wealth.
fantasies in the form of messages - from sending a message with pictures to littering text to disrupt automatic filters.

Spam issues



The main types of harm caused by spam (http://beskov.ru/2006/05/16/spam-harm/).

1. The load on the network channels - according to the latest data, about 80%
Emails sent to the Internet are spam and viruses. Increased load
leads to increased failure risks and the cost of transferring obviously unnecessary
data.
2. Clogging up space on mail servers - in many cases, spam can be detected and removed, but this does not always happen.
3. The load on the computing power of mail servers involved in spam filtering.
4. The cost of staff to configure servers, clean spam and configure anti-spam filters. (loss of working time)
5. Clogging up space on users' machines - if the user uses the client program to collect mail, in many cases spam comes directly to the machine and is stored there until it is deleted.
6. The user costs to view and delete spam from your inbox.

In particular, for the period September 15-21 , 2008, www.spamtest.ru/document.html?context=15946&pubid=208050461 for spam was 80.5%

In addition, do not forget about the moral side of spam. When every third
a letter in the box - spam, it can be a depressing effect on the psyche, at least - reduces the mood of users. Poor filter makes them
regularly clean the mailbox from spam messages leaked there, as well as
browse junk for false positives.

For Russia, the total damage of all victims of spam passes for $ 200 million a year, and the income of spam companies, according to the most indiscreet estimates, can be up to several
million dollars a year.

Spam Fighting



How to resist spam? So far, progressive humanity has come up with not very
many ways. By way of "working" with spam, they are divided into blocking and filtering. Blocking means ignoring any messages from a blocked host. Filtering is ignoring messages that fall under the definition of spam after analyzing its content (that is, the recipient still has to accept some part of the message).

By the method of setting methods can be divided into local and distributed. Distributed methods allow you to "learn from the mistakes of others."

Any method of countering spam can be distinguished by a number of characteristics by which we will evaluate and compare these methods.

- Efficiency. The most important parameter. Expressed as a percentage. It is determined as the percentage of correctly defined spam messages hidden from the user. If we subtract this value from one hundred, we get a parameter called “false-negative” responses, that is, the percentage of spam messages that still reach the user.
- False positive positives. "Clean" letters, defined as spam or not reached the user.
- Impact on network channels (as far as the method allows to reduce the load by reducing the number of spam messages).

So, briefly list the main methods of combating spam. Blockings include:

"Black" lists. Radically solves the problem of traffic, it turns out to reduce by about half. However, the effectiveness of this method is far from ideal - if the filter removes about 50 percent of spam, then the proportion of real messages deleted is about 30 percent. It hurts honest companies.

The other extreme is whitelists that ignore all other mail. The traffic problem does not solve. The efficiency is 100% :) The false positive component, unfortunately, also tends to this limit.

There are also "gray" lists as the next step in the evolution of blacklists:
The method of gray lists is based on the fact that the “behavior” of the software intended for sending spam is different from the behavior of regular mail servers, namely, spam software does not try to resend a letter when a temporary error occurs, as required by the SMTP protocol. More precisely, when they try to circumvent the protection, in subsequent attempts they use a different relay, a different return address, and so on, so this looks to the receiving party as attempts to send different letters. www.redcom.ru/isp/ispNews/netNews/ni1170203597
The rate of false positives drops to a few percent, which is pretty good.

All methods listed below are relevant to filtering.

Officially, the next group of methods has not been named. Such methods answer the question whether a letter is spam, are answered indirectly, by the behavior of letters.

There are such varieties:

Vaccination : the server immediately delivers a letter to only one user who can report that it is spam (for example, by clicking on a button in the interface of your email client). The server will “train” its main filter, and the harmful letter will not fall into the second wave of delivery. Sometimes called the method of "voting."

Trap (honeypot) . By the way, the method (I do not know for the first time or not) was proposed by K. Kaspersky in his book “Notes of the Computer Virus Researcher”. It creates a certain number of "random" mailboxes that do not belong to real users. Spammers find this address either somewhere specially spotted on any forum, or simply by searching letters (abc@mosglavprodsnab.com). Based on the fact that real mail never arrives on such a box, we can say with almost absolute certainty that this is spam. In practice, “In PC Magazine tests, SkyScan showed spam detection rate of 96% with a false positive rate of 0.48%” (http://www.lexa.ru/articles/distributed-antispam-2.html)

The checksum method. All mail passing through the mail system is scanned, checksums of letters are sent to the central server. If the flow of emails with the same checksums exceeds a certain threshold value, then the server considers this to be a sign of spam, which it happily informs mail servers in response to their requests.

By the way, all these methods give effect only when applying a distributed architecture. In particular, Spamorez uses services such as Razor and DCC to determine the mass character of letters.

The main way of counteracting such systems is “randomization” —submission of copies of the original, each of which differs very slightly in the order or set of words. You can quite successfully cope with this by calculating the checksum not from the entire content, but from some part, either randomly selected words, or one of the numerous algorithms for fuzzy text comparison.

Filtering based on the content of the letter.

Heuristic method. A set of templates is created with which the content is compared. Basically, this is done using regular expressions. Requires incredible cost of keeping up to date. Efficiency - about 70-80%, a lot of false-negative positives.

Methods of artificial intelligence. All I know today is that Spam Assassin assigns a rating to a letter using neural network algorithms.

Statistical methods, as is already clear from the title, do not focus on
sense of words in emails. That is, it does not matter in what context
the word "sex" is used - whether it is an advertisement of intimate goods, or the "sex" column in
application form. Methods are based on calculating the likelihood that a letter
is spam. The main method of calculation is the Bayes formula.
(http://ru.wikipedia.org/wiki/%D0%A2%D0%B5%D0%BE%D1%80%D0%B5%D0%BC%D0%B0%%0091%D0%B0% D0% B9% D0% B5% D1% 81% D0% B0)

The fundamental work in the field of filtration is the article by Paul Graham "A plan
for spam ”(http://paulgraham.com/spam.html), as well as its continuation - paulgraham.com/better.html . (Virtually any article in which
there is the word "spam", refers to Graham.)

In general terms, it is necessary to determine spam.
- break the text into words (in the case of an email message, this should include both the subject of the letter and some part of the headers). Specifically, words are a special case of breaking the text into parts, which are called “tokens”. For example, various methods of cutting text can resist “littering” of the text.
- the frequency of occurrence of tokens is found in spam messages (stored in the database, created in the process of filter learning)
- a representative sample is selected (Graham chose the 15 most common).
- the Bayes formula is used to calculate the probability that the letter is spam.
- if the probability exceeds a certain threshold value (80-90%), then the letter is filtered and automatically goes to the Spam folder.

[formula] The word X is found in messages marked as spam in 95%, Y in
60%.

The likelihood that a letter is spam: P (SPAM) = P (X) * P (Y) / (P (X) * P (Y) - (1-P (X)) * (1-P (Y) )) = 0.95 * 0.6 / (0.95 * 0.6 + 0.05 * 0.4) = 0.57 / (0.57 +
0.02) = 0.966.

The word X is found in messages marked as spam in 50%, Y in
60%.

The likelihood that a letter is spam: P (SPAM) = P (X) * P (Y) / (P (X) * P (Y) - (1-P (X)) * (1-P (Y) )) = 0.5 * 0.6 / (0.5 * 0.6 + 0.5 * 0.4) = 0.3 / (0.3 +
0.2) = 0.6.


In addition to the Bayes formula, Chi-vadrat distribution is also used. it
distribution, it is good only because it depends only on the bits (segments
definition area), and allows us to estimate the deviation not only from the uniform, but from
distributions of any kind. (This is what I myself am not very clear about. It would be great to hear a clear explanation).

Hidden Markov models , a method based on substantially more complex
mathematical apparatus. Used as an auxiliary method for searching.
"Almost" similar texts. Unfortunately, is a double-edged sword - except
recognition tasks copes with the generation of text. In addition to spam messages, generated messages are actively used, for example, in a live journal.

The effectiveness of statistical methods strongly depends on what is fed to the input. For example, the naive approach, when all words separated by spaces or line breaks are transferred to the algorithm, it is hard to break off if all spaces are replaced with underscores. Or write "Viagra", then the phone number, and then - three pages of text from "War and Peace" in small print.

“Cutting” the text into parts (they are called “tokens”) can be done differently, but the common part is the same: the text is “cut” into parts, which are combinations of words. For example, the OSB (Orthogonal Sparse Bigram) algorithm will cut the phrase “You are the Internet of the future” so -
- you internet *
- you * future
- * Internet of the future.
About comparing classifiers you can read a bit here: www.esi.uem.es/~jmgomez/papers/sigir07.pdf . It will also be useful to learn about a project entirely devoted to the task of text classification - CRM114 - crm114.sourceforge.net/wiki/doku.php?id=documents

What's next?



It is necessary to take the best from all methods, getting rid of the shortcomings. That is, for example, combine vaccination and automatic filtration.

From the point of view of the classification task itself, development along two perpendicular axes can be distinguished. On the one hand, this is an improvement in the hardware base and acceleration of calculations, which will allow letters to pass more quickly through a set of existing filters and, accordingly, respond to it.
On the other hand, this is the development of the algorithms themselves. For example, a conventional Bayesian filter with the inclusion of the OSB classifier increases the filter efficiency several times.

Conclusion



Spam is ineradicable because it works. And if he works, it spurs the invention of new methods. No silver bullet. And there is an equilibrium that oscillates in one direction or another. We still have to support him ...

Source: https://habr.com/ru/post/46235/


All Articles