Passive fingerprinting to detect synthetic traffic

For quite a long time, I had the idea of looking at clients of a public web service, whose browser sends the User-Agent header like a browser on Windows, and which at the same time have all the signs of a network stack of * nix systems. Presumably, this group should have a large concentration of bots running on low-cost hostings to cheat traffic or scan a site.

Briefly about the subject

Different implementations of the TCP / IP stack in operating systems have different default parameter values. This allows with a good degree of reliability to conclude which operating system has formed a package.
In this context, the set of operating system-specific package parameters is called OS fingerprint. Since this method involves only the observation of passing traffic without sending any requests, the method is called passive OS fingerprinting.

I use nginx as a front server, and there is no mod_p0f for it as for apache, therefore marking requests on the basis of a fingerprint in it is not an easy task, but it can be solved. Below I propose to consider the decision by which I achieved the result.
')

Decision

As mentioned above, the group that is interesting to me is nix machines that pretend to be Windows. It is necessary to have an understanding inside nginx, from which OS is the connection. I decided to label the required connections by sending them to a separate port port nginx by TTL criterion.

iptables -A PREROUTING -t nat -p tcp -m tcp --dport 80 -m ttl --ttl-lt 64 -j REDIRECT --to-ports 8123

In nginx, then everything becomes quite simple.
Add an additional port:

  listen 80; listen 8123;

Note the variable requests that came to this dedicated port.

  map $server_port $is_specialport { default 0; 8123 1; }

Mark the proxy servers. There are many such requests due to Opera Turbo and the like.

  map $http_x_forwarded_for $is_proxy { default 0; ~^. 1; }

Sign of the screw user agent.

  map $http_user_agent $is_windows { default 0; "~Windows" 1; }

And finally, we define a flag variable for the cases when the request has a user-agent agent, not proxied, and has a low TTL:

  map $is_windows$is_specialport$is_proxy $is_suspected { default ""; "110" is_suspicious; }

We pledge the value of the flag for all requests:

  log_format custom '$remote_addr - $remote_user [$time_local] ' '"$request" $status $bytes_sent ' '"$http_referer" "$http_user_agent" "$upstream_addr" ' '"$gzip_ratio" "[$upstream_response_time]" "$upstream_cache_status" "$request_time" "$is_suspected"'; access_log /var/log/nginx/nginx.access.log custom buffer=128k;

findings

Of course, I do not believe that the method gives greater accuracy, but the observation of the logs revealed:

customers from whom requests were sent exclusively to statistics counters
bots that were aimed at parsing Vkontakte, but wandered onto the site by a link from social networks
evil of a special kind, which is also not a live user

The share of hits is very good, really worth a look.

PS
Of course, I know that defaults are easy to change and, of course, TTL is not the only criterion that could work in this mechanism.

UPDATE:
According to the proposal in the comments, a counter was installed in the body of the article. According to him:

Saving the referrer in the group in question

Total	No referrer (% of all)	Suspicious (% of all)	Suspicious without referrer (% of all)	Suspicious without referrer (% of requests without referrer)
144623	2.12968%	6.70156%	0.407957%	19.1558%

Percentage of requests from known anonymous proxies per group under consideration
According to MaxMind GeoIP2 dated 10.21.2014

Total	Requests from the AP	Suspicious requests with AP	Group share among queries with AP
144623	160	124	77.5%

Raw Data: gist.github.com/Snawoot/d1f6ce46099555c668ca
The criterion reveals unhealthy traffic fairly well, and if there is a lot of it from one source of traffic purchase, there is something to think about.

Source: https://habr.com/ru/post/241309/

All Articles

Passive fingerprinting to detect synthetic traffic

Briefly about the subject

Decision

findings

More articles: