Determine the Phantom bots

»Translation of the article Detecting PhantomJS Based Visitors | A good discussion of the article on Hacker News

The article is old, do not throw tomatoes - better share your experience in the comments.

Nowadays automation is used in many security incidents (by attackers). Web-scraping, re-use of passwords, click-fraud - all this is done by attackers trying to (often successful) disguise themselves as a regular user, that is, in fact, look to the server as a regular user’s browser . As the owner of the site, you probably want to be sure that you serve people and not soulless hardware, but as a service provider you probably also want to give access to your content via api, and not through a heavy and buggy web interface.

Suppose that you already have a simple check for cUrl and similar visitors, and it is quite effective. The next step is expected to put a check on the fact that your clients are real and use a real browser, with a blunt and buggy UI, and not bots on handicrafts such as PhantomJS or SlimerJS .
')
In this article we will look at a couple of tricks for determining phantom bots. I consider only a phantom, as it is more popular, but many points can be used for SlimerJS and the like.

Important! The considered methods are applicable to both phantom branches (1.x and 2.x), unless explicitly stated otherwise.

For a start: is it possible to identify a phantom without even responding to it (that is, solely by its http request)?

HTTP stack

You must know that the phantom is built on a QT framework . So, Qt implements the HTTP stack a little differently than other modern browsers.

First, let's take a look at Chrome's simple http request:

GET / HTTP/1.1 Host: localhost:1337 Connection: keep-alive Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 Accept-Encoding: gzip, deflate, sdch Accept-Language: en-US,en;q=0.8,ru;q=0.6

And now the same request in phantom:

 GET / HTTP/1.1 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X) AppleWebKit/534.34 (KHTML, like Gecko) PhantomJS/1.9.8 Safari/534.34 Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 Connection: Keep-Alive Accept-Encoding: gzip Accept-Language: en-US,* Host: localhost:1337

Note that phantom headers are different from chrome (as well as most modern browsers):

Host header goes last (chrome first)
Header Connection value (note case)
Accept-Encoding phantom gzip only
User-Agent contains “PhantomJS”

Checking the distinction between these server-side headers can help determine phantom setups.

But how safe is it to trust such verification? If an attacker uses a proxy to overwrite these headers, then in general, it will not be difficult for him to mimic under a normal browser.

It seems that such a solution does not pull on a silver bullet. Ok, let's take a look at what we can do on the client using the phantom features of the JavaScript environment.

Check User-Agent on client

we may not believe the User-Agent received in the request, but what about the value in the client?

 if (/PhantomJS/.test(window.navigator.userAgent)) { console.log("PhantomJS environment detected."); }

Unfortunately, this is as easy to change as a request header, so this is clearly not enough.

Plugins

navigator.plugins contains an array of plug-ins installed in the browser. Usually it contains something like Flash, ActiveX, support for java applets or Default Browser Helper which indicates that this browser is default in OS X. Our research shows that most installations from scratch of common browsers contain at least one default plugin even on mobile phones.

This is where PhantomJs is different - it does not install any plug-ins and moreover - does not provide any way to install them ( PhantomJS API ).

The following check can be quite helpful:

 if (!(navigator.plugins instanceof PluginArray) || navigator.plugins.length == 0) { console.log("PhantomJS environment detected."); } else { console.log("PhantomJS environment not detected."); }

On the other hand, it is extremely easy to change the navigator.plugins array by executing the js code BEFORE loading the page ( as here ).

Also, no difficulty is to create a custom build with these installed plugins. This is much easier than it seems because QT, on which the phantom is built, provides opportunities for connecting npapi plugins.

Timing

Another interesting point is how PhantomJS cuts JavaScript dialogs:

 var start = Date.now(); alert('Press OK'); var elapse = Date.now() - start; if (elapse < 15) { console.log("PhantomJS environment detected. #1"); } else { console.log("PhantomJS environment not detected."); }

After several checks, it can be assumed that if the dialog closes in less than 15 milliseconds, then the browser is most likely not controlled by a person. But the use of this technique implies some negativity on the part of real users, who will be forced to close incomprehensible windows. (in fact, this moment can be circumvented by attaching to any actions of the user, for example, by suggesting something when you hover over an element - the moment when the user says “no, thank you.” It’s also a bit intrusive, but at least some sense what is happening from the user's point of view - approx. translation.)

Globals

PhantomJS 1.x provides two types of globals:

 if (window.callPhantom || window._phantom) { console.log("PhantomJS environment detected."); } else { console.log("PhantomJS environment not detected."); }

But this is part of experimental technology, so it can still change.

Chips JavaScript engine

PhantomJS 1.x and 2.x do not use the latest versions of WebKit, which implies the absence of new fashionable buns introduced in the latest versions of browsers. This automatically extends to the JS engine, that is, some properties and methods behave differently or are completely absent in PhantomJS (although the truth is not clear how it all differs from just the old browser - approx. Transl.)

One of these methods is Function.prototype.bind, which is absent in PhantomJS 1.x and later. The following example checks if bind has a prototype of a function, and if there is, then whether it is native but not enshrined.

 (function () { if (!Function.prototype.bind) { console.log("PhantomJS environment detected. #1"); return; } if (Function.prototype.bind.toString().replace(/bind/g, 'Error') != Error.toString()) { console.log("PhantomJS environment detected. #2"); return; } if (Function.prototype.toString.toString().replace(/toString/g, 'Error') != Error.toString()) { console.log("PhantomJS environment detected. #3"); return; } console.log("PhantomJS environment not detected."); })(); </script>

If this code seems slightly incomprehensible to you, you can look at a little explanation in detail here (video) .

Stack traces

Errors that generates JavaScript code processed by PhantomJS through the evaluate command contain a unique stack by which you can determine the "headless" browser.

Suppose that PhantomJS invokes processing in the following code:

 var err; try { null[0](); } catch (e) { err = e; } if (indexOfString(err.stack, 'phantomjs') > -1) { console.log("PhantomJS environment detected."); } else { console.log("PhantomJS environment is not detected."); }

Pay attention - here we have a custom indexOfString () function (we left the implementation behind the brackets assuming that the reader will not have any difficulty to implement it) since the native String.prototype.indexOf can be replaced with PhantomJS (user script) and return a negative result. (which in general is also easy to check - approx. translation.).

So, now how do you get PhantomJS to execute this code? One of the techniques is to rewrite the most frequently used DOM functions that are likely to be called. For example, the code below rewrites document.querySelectorAll to intercept the browser's trace trace:

 var html = document.querySelectorAll('html'); var oldQSA = document.querySelectorAll; Document.prototype.querySelectorAll = Element.prototype.querySelectorAll = function () { var err; try { null[0](); } catch (e) { err = e; } if (indexOfString(err.stack, 'phantomjs') > -1) { return html; } else { return oldQSA.apply(this, arguments); } };

Total

In this article, we looked at 7 different PhantomJS definition techniques, both on the server and on the client. By combining the results of the test with feedback (for example, by measuring the rendering speed or breaking the session cookie), in principle, you can arrange the difficulties of PhantomJS for visitors. But keep in mind that these techniques are not strict and infallible (in fact, none of them work on custom assemblies - approx. Translation.) And an advanced enemy can break through the defenses. To deepen the topic, we recommend to view our presentation (video, slides) . There are also GitHub turnips with examples and possible workarounds.

Thank you for your attention and good hunting!

Source: https://habr.com/ru/post/303378/

All Articles