IP blacklist do-it-yourself

Most recently, I posted an article about the security of the site and, in particular, the problem of the captcha and the big question - is it possible to get rid of it and how to do it.

The discussion was lively and very productive. As is often the case with me, as a result of analyzing user comments, I changed my opinion on a very large range of issues outlined in the article.

Here I would like to summarize the subject of such a burning topic earlier and to voice the next steps that I am going to take for its development. They relate to creating your own blacklist of IP addresses. I, as always, do not assert anything, but only propose options.

Error Analysis

And one mistake, but quite significant. Suddenly it turned out that there is not only a theoretical, but also a practical possibility to make such a robot that either cannot be either difficult to distinguish from a person. Such a robot can download the client JS code and click on any buttons on the site. Since I have never met such animals in my practice, I was not so sure that they did not exist at all, but, let's say softer, I wanted to test their effect on myself (my website). After the publication of the article being analyzed, anonymous, but undoubtedly kind people gave me this opportunity. Only fully experienced, failed.
')
What can I conclude after becoming acquainted with these robots? I can not say that everything is good or everything is bad. Definitely right were those who said that everything depends on the motivation of the spammer and the value of the site that is being spammed. If the site does not mind spending time, and therefore the resources of the spammer (time, money, desire) are almost endless, then it is impossible or extremely unprofitable to fight such spam without a captcha. What is meant?

Attacks went to my site. I changed the name of the js function, which was performed by pressing the send button of the ad, and the attacks stopped as it was cut off. Could this happen for the reason that the robot was programmed to execute the script by name? It could. Could he then be reprogrammed to something else? For example, on the search button and parsing it? Yes, it could. Then, to fight this robot, I will have to close more and more things, hide the button, do some more laborious tricks, and still they can be overcome. This, on the one hand, is sad, but on the other hand, it’s much easier on my part to deliver a cool, impenetrable captcha and close the question securely, albeit at the user's expense.

But please pay attention, the robot has not been reprogrammed! This means that spamming my “spam collector” is not profitable. This clearly confirmed the thesis that the site being attacked must have adequate value for this.

Conclusion 1 . It makes no sense to put very sophisticated protection on sites that are gaining popularity. It is better to try using the " Ajax Button " method, which I described in the article or any other similar method that was suggested in the comments. At a minimum, such an approach will not scare away those users who are, and will not be a brake on conversion. And only as the beginning of complex attacks should analyze the motivation of the spammer and already in this regard, look for methods of struggle, the most recent of which I see a difficult and unkind captcha for the weak-sighted person.

Conclusion 2 . My " allow me to enter " method has by and large proved to be useless. You can implement the same functionality with much smaller and cheaper means without losing functionality.

Conclusion 3 . I understood why Yandex put a captcha on every request for receiving data in the search word selection tool! I have to take off my offense for this captcha and virtually apologize (since I was offended by virtually, too).

What else would I like to say about the described virtual browser attack on my spam collector? There is something, and it will already be from the category of " good news ." The fact is that all requests came from different ip-addresses, and they were all bad ! What does " bad " mean? These were either addresses that were already on my site and were marked as suspicious or dangerous, and I selectively checked others on my favorite site www.projecthoneypot.org , and many (most of them) were marked as dangerous.

Conclusion 4 . Marking IP addresses as dangerous or suspicious can help in the fight against spam. There are services that provide this data for free or for money. Free information most likely will not save anyone, because they are limited in size, but paid data could be of significant benefit. Those who, for various reasons, do not want to spend money on services of this kind, could be offered to implement this service on their own. That is what I would like to reflect further on.

For which sites can such a service be useful?

Generally, for all. This service makes the site owner to be aware of what is happening on it. This is always useful simply in order to maintain order. I make such a conclusion solely on the basis of my experience. Accounting for uri and ip helped me a lot in setting up my site.
For sites that have extremely useful online services, and whose owners are concerned on the one hand the speed of user access to this service, and on the other hand reducing the load on the server.
For sites that are spammed and it starts to turn into a problem.
For those programmers and site owners who are extremely motivated to collect similar statistics, but sometimes they themselves find it difficult to explain their motivation. To such programmers, I can take myself.

What to do with the marked IP addresses?

IP addresses found to be harmful can be affected by rights. For example, they can be given a captcha when filling out forms. With a high degree of probability - it is not a man. Posts from such IPs can be sent to moderation, and the moderation itself can be postponed “last but not least”. When moderating, give a link to the mark of the client's address as a spam sender. It is possible directly in the letter of notification to the site administrator about the received comment to send a one-time link to the client's mark.

Entries from bad IP addresses can be not processed, not brought into normal form and not redirected to the correct addresses, which can significantly increase the speed of the site.

Let me explain the last statement by example. A certain site for many years. This is a site of articles. A long time ago the owner of the site, some articles were placed on certain resources specifically. A number of materials were also dragged off and reference was made to the primary location. These articles have since hung on these resources, it’s not possible to reach an agreement with the owners of these resources, and the links to these articles have changed. And while there are constant visits to the site at these addresses and they have to analyze and redirect. But once, analyzing those who have to be redirected, there were very reasonable suspicions that the site does not work entirely for people. Very many visitors are robots. And it can be such robots, which by and large do not always want to let yourself. Here, too, can help cool mark IP addresses. After all, we can mark addresses not only as “bad” and “suspicious”, but also as “desirable bot”, “unwanted bot” and so on.

Base IP addresses can be shared for free, and you can even make some business out of this activity.

On what basis to mark IP addresses?

Everything that I will write further is based on my personal experience, and the latter, as it turns out, is not so universal. This is due to the features of the experimental site. But I hope that interested readers can use the ideas presented as a " source of inspiration ."

Request critical files of certain engines that are not used on the site

For example, I do not use the WordPress engine on my website, but I receive a request to read the engine configuration file or the login page for the administrator.

Requests to the system folders of the server type " /../../../ .. " or to the files of the type " passwd "

Below is a real list of values for which I catch bad IP. This is a very small list, but it works quite well.

$patterns = array( '#/wp-#', '#/browser#', '#/includ#', '#/engin#', '#admin#i', '#system#', '#/bitrix#', '#/cat_newuser#', //  ,   5   (8) '#/forum#', '#/common#', '#/plugins#', '#\.mdb/?#', '#\.aspx?/?#', '#^/BingSiteAuth#', '#passwd#', );

Requests to CGI scripts if they are not used

Unfortunately, such requests via .htaccess are not caught (I do not know how to catch them yet), but there may be no access to editing virtual hosts. But these requests fall into the error log and the IP addresses that accessed these files can be taken from there.

Requests for pages that have fallen into disuse during the reorganization of the site or for some other reason

Here is an interesting technique that I first used by chance, and then, after observing the results, I thought that it might be interesting for intentional use. Suppose there is a registration page. It is called no matter how. Let her Uri be, for example, "/ reg-new-user /". And this page began to use robots. Either try to register if there were no defenses, such as an “Ajax button”, or just too many hits were made to the page compared to real registrations. Then we change the Uri of this page and do not do any redirects from the old to the new. And requests to the old continue to go and go. And they go for years. If you look back like that, then 8 years have passed. Logically, all IPs that are crammed with addresses are immediately flagged as dangerous. It turned out the trap of harmful robots. By the way, the search engine robot Bing is also breaking at these kinds of addresses. Moreover, this is not a fake robot, but a real one. It is surprising that he scans such to get these addresses? Maybe he scans and indexes the secret hacker forums? A good question, to which I, unfortunately, do not know the answer.

IP trap based on the analysis of submit form data

To organize this trap, you need to have in the form of a field that is filled with a client script. That is, a real user, working on a real browser, loads up a script that, when a button is pressed on a form, fills a field with a completely specific value. The field name must be incomprehensible for the robot, and the value must be encoded and better one-time. This value can be generated from a timestamp processed by a specific algorithm. Then the robot will either not fill the field, or fill it with the wrong value. Having an unexpected value in this field immediately causes the IP flag to be dangerous. The method described in the previous article, codenamed " allow me to enter ", works for me as such a trap.

It should be clarified specifically. This method does not work for virtual js browsers, which can also be used to automatically access pages and send spam.

Special traps

This means traps created specifically for robots. For example, I made a bulletin board at some stage. This tool is not justified and has been disabled. But for some time he took part in the board catalogs and gained a wide and true “clientele”, which has been operating for many years. In order not to “offend” customers, I again turned on this tool, turning it into a kind of pier for robots. All who go there, most likely not people, and they can be safely marked as robots that send spam.

Unnatural requests

We gradually approached the methods of tagging more theoretical than practical. Yes. Obviously, there are short series of requests from the same IP to different pages with high speed (up to several pieces per second). Such series heavily load the server. They can produce both harmful robots and bots, and spiders, the usefulness of which I have not yet been clarified. Such strange bots tag themselves in agents and do not stand on ceremony at your site. Most recently, I witnessed a real situation when a robot tagging itself with the name “ ahrefs ” got into the url of searching for a product by filters and stopped for a day a fairly large online store, since this query was not optimized from the point of view of MySQL. And if the robot would not be disabled via .htaccess , then it probably would never have stopped.

But how to catch such unnatural requests with the least cost? I thought a lot and have not yet invented anything except manual tagging and blocking based on the results of access logs. Since the logs of a good site can reach tens of megabytes, this issue also goes into the realm of fantasy. Worst of all, urls can be requested for real and work honestly, and do not get into any logs of bad uri. All that remains is the exit of the IP tag for user agents. I will carry out this method in separate point.

Marking IP by Client Agent Header Content

HTTP_USER_AGENT - the header, which is difficult to navigate when catching bad clients. It is much easier to tag good customers, for example, bots of desirable search engines with this header. But there are cases of fakes. In any case, the IP addresses of the search engine robots are best tagged all at once and in manual mode. To insert pools of such IP into the database, you can use the whois information. In this case, all the IP addresses of clients with googlebot agents, but not belonging to Google, can be safely marked dangerous.

Cases of real, but undesirable or incomprehensible bots. That is, we, as a result of analyzing some kind of accident, or analyzing access logs, or somehow find a bot with the name in the agent. Depending on the behavior of such a bot on your site, we mark all of its IPs automatically or manually, and we mean it. For example, you can deny him access to some pages in general. On the product search page from the catalog of the store, according to a complex filter, no robot at all, whatever it is, in my opinion, has nothing to do.

There is also an idea to mark the IP of all clients who do not have the HTTP_USER_AGENT header at all . The probability that a simple and honest user will delete this header or otherwise manipulate it is quite small. It remains, however, the option of illiterate customization or technical failure on the client side. So I do not mark and do not plan to tag IP with the missing user agent. But I think about somehow paying attention to them, after all.

Have there been cases of attempts to put a shell through an agent in my practice? Yes! there was once an agent with a code for replanting a shell. But again, there is a high probability that this was done as an experience by some specific individual from a normal IP address. And there were agents with strange contents. Like this, for example: " () {:;}; echo Content-type: text / plain; echo; echo; echo M`expr 1330 + 7`H; / bin / uname -a; echo @ ". It is difficult, however, to say what it is and how it can be used.

Determining the robot from the contents of other headers

HTTP_REFERER for robots is usually synthetic. It may be absent, but may coincide with the Uri requested. It is difficult to analyze this title in the general case, but there may be conditions on the site, when it also goes into action. For example, it looks suspicious when, at the first entry into a page with a form, all the parameter arrays are empty, and HTTP_REFERER is equal to the Uri of the page, as if it were reloaded.

Virtual js browsers usually have normal headers that you don’t find fault with. Their titles from the people, probably not to distinguish.

Calling a script with php extension

A complete, 100% exit from extensions seems useful. Then a page request with the php extension will signal to us that the robot is coming, and most likely undesirable.

Corrupted or incomplete requests, including requests without a final slash, if required

Missing a slash, if it is required, it is customary to simply add and make the 301st redirect. Should I mark customers with such an error? In my opinion, if it is, it is only for individual pages at the discretion of the site administrator. For example, invoking an ajax backend page without a backslash looks very suspicious.

But there are really distorted requests in which double slash occurs. With him, I thought for a long time what to do. And in the end I decided not to fix them and not redirect. It seems to me that this is unlikely to be a man. As an option, it is possible to issue a special page 404 with such requests with a proposal to check for a double slash in the address bar and correct it.

There are queries that are unfinished, for example, consisting of half, for example, " / some-category / arti " instead of " / some-category / article / 12345 / ". Unfortunately, this also means “good” bots, so if you mark such clients as “suspicious”, then only in manual mode.

I, as the owner of one second-level domain and a large number of related third-level domains, noticed an interesting feature of the behavior of robots. Having met a certain uri on one third-level domain, it is automatically applied to all subdomains. Such clients can, I think mark immediately. It is unlikely that a person will manually substitute Uri to all subdomains. And the motivation for such actions is rather unobvious.

Requests where the script name is url-decoded

If a script is called in full or in part, url-decoded, then it is extremely suspicious and obviously not aimed at good purposes. In such cases, you can immediately mark the client as dangerous. There have been no such cases in my practice.

Access Method Mismatch

On the website there can usually be cases when the method of access to the script must be strictly defined. For example, it may not be a human error to call a script that processes a form with the GET method, when parameters passed by the POST method are expected. Or call ajax-backend method GET.

The cases are suspicious, but I don’t remember that I have ever met one of them in reality.

Direct call ajax backend

The bottom line is that the site may have scripts that call other scripts and never a person calls by entering the address of this script in the address bar of the browser. To catch on this method, ajax-backend scripts must be separate. This should not be an index.php file with a bunch of options. Unfortunately, individual scripts of ajax-backends are not always and everywhere implemented. I have implemented such an approach, and I am catching the described cases, but I have not caught a single event yet.

Inquiries similar to injections and file opening through parameters

With the transition to CNC (SEO links), or just URL rewriting in order to hide the page settings, these attacks made them exotic. It is rather unlikely in the urla " some-site.com/12345/ " to introduce an injection instead of " 12345 " and get something sane. Perhaps that is why such experiments began to occur less and less. In addition, separate but real individuals from their real IP addresses, for example, from a 3g modem, did such things in my practice. So the big question is whether to tag them. By the way, not so long ago, just three years ago, the IP addresses of mobile networks were very much compromised (I don’t know now). It was just hard to work. He himself suffered with a network of Yota and MTS.

I tried to implement the POST analysis of the page parameters for suspicious words and symbols and, in the end, refused it. Such an analysis has become too complex and, therefore, resource-intensive. As a result, I reduced the list of suspicious words in the parameters to quotation marks alone, and this already seemed to me anecdotal. In addition, the page settings are still filtered, and it is difficult to distinguish dangerous quotes from safe quotes for analysis, which should be instantaneous.

Performance issues

It is necessary to approach the question of tagging customers with great care. In pursuit of the analysis of requests, you can not overclock, but finally freeze the site. I do analysis, but I do it only in the case of " bad uri ", that is, those that do not constitute a single real page and cannot be shown. That is, client analysis is part of the output of page 404. And even in this case, I try not to overload the server too much. When using regular queries, especially if they are assembled into an array and are used in a cycle, preference must be given to the simplest and fastest. When accessing the database, try to do all the operations in one request. Well, other possible code optimization methods. If the operation cannot be accelerated enough, then I leave it for manual viewing.

If clients' IP addresses are not only marked, but also blocked

Separately, I would like to mention checking the client's IP for blocking. This check happens on every request of any user.

When checking an IP address for a blocked one, it is better not to look for the address in the common table with thousands of addresses, but to have a separate table of only blocked IPs and look for it. The speed of a MySQL query search is quite dependent on the number of records. When searching the database, try to make such queries so that the search does not go through the data using the index, but only by the index.

One could try to look for blocked IPs not at all in the database. There is an idea to have a file with a serialized array of blocked IP addresses on the disk. Then each script reads a file, translates the serialized string into an array and works with it. There is a suspicion that such an access method will be faster than accessing and searching the database. But this should be checked, and for my site it is not so important due to its moderate activity.

List of accumulated bad uri and tagged IPs

I have done on my website dedicated to programming a service that shows accumulated bad uri. This is actually a piece of admin, but is available for viewing. What you need to keep in mind when using it.

The table takes into account the Uri in the context of domains
Every day the table is replenished by 100-300 records. The base is growing very fast and I have to clean it. I shorten some similar to each other uri and pump them into the so-called pivot table. In this case, information is lost about the client that this Uri has caused. But for me personally, these uri are not needed at all. I first looked at them and deleted them altogether. Then he began to store, only to show someone who would be interested. The practical benefits of bad uri are not very clear to me.

The process of shortening a uri is something like the following: For example, there is a frequent uri of the form “ / some-page / 12345 / ”, where “ 12345 ” is a variable part. Then all such uri, they can be very many, are reduced to " / some-page / ### / ". And if it is not the numbers that change, but the letters, then the abbreviated uri may look like “ / some-page / ABCD /” or even “ / some-page / _any_trifle_ / ”.
Service is in the process of continuous development. A little later, within a week, it is planned to make a free distribution (download) of marked IP addresses.
The service has flaws. For example, the moment of the last bad event for IP is not remembered. This is a very significant drawback, which does not allow removing the danger mark of IP due to the duration of the last bad event. Also, the service of user manual IP tagging is not implemented as good.
Currently, only the log of bad ur and the log is posted. I'm going to improve the service.

Please do not enter my website at the bad uri specified in the service! In this case, your IP may be marked bad. If you want to test the service, enter a neutral non-existent Uri, such as / adfsadf / . Then the IP will be listed, but it will not be marked bad.

Source: https://habr.com/ru/post/265075/

All Articles