⬆️ ⬇️

We put networks - we catch robots





It all started as usual - with suspicion. It hurts a lot of people come to my sites that download application distributions for direct links without referrers. It is somehow strange, right? Link blockers cannot be so popular. I noticed some addresses and users usually came again to download without a link. Often, another program - not related to the first. Then we went to the course of checking that yes as in the logs. It turned out that the absolute majority of such visits are strange users who have empty HTTP_ACCEPT_ENCODING and HTTP_ACCEPT_LANGUAGE. HTTP_USER_AGENT sometimes points to java, javascript, wget, perl, php and so on, but most are normal browser strings. All more or less decent search engines for me have long been taken into account and this, of course, is not them.



Then it became even more interesting - what is it. That is, it is clear that these are robots, but why? Why come once a day - two - a week and pump all distributions of completely different directions? Even the minimum answer has not yet been. But after watching logs for a long time, I began to notice that there are often almost identical IPs - that is, from the same subnet, and therefore sorting by the number of hits from one IP shows nothing interesting except in clinical cases. I had to look for a log analyzer with the ability to group by subnets. And not finding anything at once, I wrote my bicycle, as usual.



And what did I find interesting? Well, not a lot of interesting things and unfortunately 90% of robots are not identified at all. There are a lot of security companies that check web pages (and files themselves) for the security of their anti-virus products (Kaspersky, simantek, infosek, infatch, bitdefender ...) - but these are not the main ones, they are noticeable only by periodicity. The largest number of robots in the networks of the French cloud provider OVH, in the Asian subnets of Amazon, by itself in HETZNER, just somewhere in China, in the Digital ocean cloud and more and more in the Alibaba cloud. Why - I did not understand. But it is clear why many in the clouds - because the incoming traffic there is free. And Amazon, for example, says so - come to us and launch the web of robots . There are also many hole seekers in popular CMS. Requests to wp-login.php I already have for a thousand per day. By the way, requesting it can be immediately recorded in robots.

')

The question arises, what to do with them? Well, you can do nothing. Traffic is cheap now, servers too. And it is possible to block directly with subnets - it’s still practically impossible to see living visitors from there. Yes, and outgoing traffic in the clouds paid, albeit a penny. And most importantly for me - they spoil the statistics, which is harder to analyze because of them.



In general, I post a log analyzer that collects statistics on subnetworks - suddenly someone will be able to find something more interesting. I did not find the answers.



Here are the source code of the analyzer (C ++, STL). Do not be afraid that it is under Windows - the analyzer's core is decoupled from the interface and there are even two types of project - the console version and the GUI. For porting to other platforms while STL from C ++ 11 is not enough, for good, filesystems from C ++ 17 are needed to make it 100% portable. On the other hand, only one function needs to be replaced - traverse the directory.



The analyzer understands files with subnet data from the official NIC (Network Information Center) regions (there are links on the githaba ) or more accurate (but crooked) db-ip.com (there often CIDR is not parsed normally). To work, you need three folders where the logs are located, where the subnet files are located (to build the base of subnets) and where to put the compiled base with subnets (well, not to parse each time) and report. For after analysis, an html report is generated with subnets sorted by the number of hits and the volume downloaded. By clicking on the address in the report, a third-party service opens, showing the owner of the subnet. There actually all information is scooped. On the grid, a list of specific addresses from the subnet (and also sorted) is opened.



The report looks like this:







If you have a lot of visitors, put at once large limits on inclusion in the report. There is a minimum number of hits from the subnet and the minimum amount of traffic.



ps. If you don’t want to build it yourself, it’s compiled under Windows .

Source: https://habr.com/ru/post/323118/



All Articles