Browser Fingerprint - anonymous browser identification

Valentin Vasiliev ( Machinio.com )

What is Browser Fingerprint? Or browser identification. A very simple wording is the assignment of an identifier to the browser. The wording is simple, but the idea is very complex and interesting. What is it used for? Why do we want to assign a browser identifier?

We want to consider our users. We want to know whether the user came to us the first time, he came the second time or the third. If the user came the second time, we want to know which pages he visited, what he did before. With anonymous users this is not possible. If you have an accounting system, the user logs in, we all know about him - we know his account, his personal data, we can attach any actions to this user. Everything is simple here. In the case of anonymous users, things get much more complicated.

The second scenario - personal advertising. It is now everywhere. We go in, and suddenly we are shown an advertisement for some pies that we wanted to buy yesterday. How it's done? This is done through user identification.
')
The third scenario is internal analytics. If you use, in addition to Google Analytics or Yandex, your own samopisnuyu analytics system, Fingerprint JS and Browser Fingerprint, in general, can help you achieve almost complete identification of anonymous users. You will be able to see what the user did on your site, which pages he visited, which links he clicked, etc. And build on the basis of this whole picture, a map of user actions. All this is achieved with the help of this technique - Browser Fingerprinting.

Why not just use a http cookie for this purpose? It's very simple, and everyone knows how to do it. It works, you all know how.

The user comes to your site, we read his cookie, if there is any identifier there, it means we already had one, and we know who he is. We do all our analytics, tracking, etc. associated with this user.

If there is no identifier there, it means that the user has come to us for the first time. We generate a unique identifier, a GUI, a binary string of some kind, write it in a cookie, and then when the user comes next time, we will read this cookie and understand that this user has come to us in the second, third and subsequent times.

Cookies have one big drawback - it can be cleaned. Anyone even a technically unskilled user knows how to clear a cookie. He clicks Settings, enters and clears. Everything, the user becomes again anonymous for you, you do not know who he is.

All modern browsers, even Internet Explorer, seem to offer incognito mode. This is a mode where nothing is saved, and when a user visits your site in this mode, he leaves no traces. The next time he enters incognito mode, again, you won’t know who he is or if he was with you before. Those. in incognito mode, http cookies will not work.

Currently, due to the popularity of such characters as Snowden, etc. many prefer different privacy modes, anonymity on the Internet, modes, plugins and whatever. All this impedes Internet tracking and identification. Many users use this without even knowing why. Just install, just because it is fashionable. And they become anonymous for you again. Http cookie in this case will not work.

How did programmers try and try to solve this problem?

The most successful project in the field of preserving information in a cookie so that it cannot be deleted, in my opinion, is the evercookie project, or persistent cookie - a non-removable cookie, difficult to delete a cookie. Its essence lies in the fact that evercookie does not just store information in a single storage, such as http cookie, it uses all available storage of modern browsers. And stores your information, for example, the identifier. He starts using http cookies, writes the identifier there, then, if Flash is available in the browser, he uses local shared objects to write information to the so-called. Flash cookies.

Flash cookies were not cleared until recently when you cleared cookies. Only the latest versions of Google Chrome can clear Flash cookies when you clear regular cookies. Those. Until recently, Flash cookies were virtually undelete. There was a special page on the macromedia website where you had to go, click the button: “Yes, I want to clear the Flash cookies”, and then they are cleared, i.e. without this page it was impossible to clear.

Further, evercookie uses Silverlight Cookies. In a different way, they are called Isolated Storage. This is a special allocated space on the user's hard disk where cookie information is written to. Find this place is impossible if you do not know the exact path. It hides somewhere in the documents Setting, if on Windows, deep in the depths of the computer. And this data can not be deleted by cleaning cookies.

Further. Evercookie uses innovative techniques such as PNG Cookies. The bottom line is that the browser sends a picture in which the bytes of this picture encode the information you saved, for example, the identifier. This picture is given with the caching directive forever, for example, for the next 50 years. The browser caches this picture, and then the next time the user visits the Canvas API, the bytes from this picture are read and the information that you want to keep in the cookie is restored. So even if the user clears the cookie, this PNG-encoded cookie will still be in the browser's cache, and the Canvas API will be able to read it on a subsequent visit.

Evercookie uses all available browser storages - modern HTML 5 standard, Session Storage, Local Storage, Indexed DB and others.

The ETag header is also used - this is the http header, which is very short, but some information can be encoded in it, and if Java is installed, the java presistence API is used.

Evercookie is a very smart plugin that can save your data almost everywhere. For a regular user who does not know all this, it is simply impossible to delete these cookies. You need to visit 6-8 places on the hard disk, do a number of manipulations in order to only clear them. Therefore, the average user when visiting a site that uses evercookie, certainly will not be anonymous.

Despite all this, evercookie does not work in incognito mode. As soon as you go into incognito mode, no data is stored on the disk, because this is the fundamental essence of incognito mode - you must be anonymous. Evercookie uses hard disk storage, which does not work in this mode.

FingerprintJS is a small library that I wrote that tries to solve these problems. I will tell you how she does it, and what came of it.

I wrote it in 2012. I then worked at Ruby's Courier as a developer. And I had the task to make such an analytics system in order to take into account not just the logged in users, i.e. those users that we have in the system, as well as anonymous users. Specifically, the CouponCupon website had many anonymous users, because people often came from outside to look at some coupons, discounts, and offers, they didn’t have an account, and therefore the tracking system, the tracking system for the pages visited, clicking on buttons - all this did not work, because users were anonymous.

FingerprintJS, in general, does not use cookies. No information is stored on the hard disk of the computer where the browser is installed. It works in incognito mode, because in principle it does not use storage on the hard disk. It has no dependencies, it works even without jQuery, and the size is 1.2 Kb gzipped.

Currently used in such companies as Baidu, these are Google in China, MasterCard, the site of the US President, AddThis - the site for placing widgets, etc.

This library quickly became very popular. It is used by about 6-7% of all the most visited sites on the Internet at the moment.

How it works, I will tell in detail. Its essence is that the code of this library polls the user's browser for all specific and unique settings and data for this browser and for this system, for the computer. This data is combined into a huge string, then it is fed to the input of the hashing function. The hash function takes this data and turns it into compact beautiful identifiers. How it works in detail, I will tell.

First read userAgent navigator. Suppose it is clipped here, it joins the final line of the print.

Browser language is read - what is your language - English, Russian, Portuguese, etc. Also attached to the print line.

The time zone is read, this is the number of minutes from UTC:

–180.

This is -3, it turns out Moscow.

Next comes the screen size, array, screen color depth.

Then all the supported HTML5 technologies are received, i.e. each browser has different support. FingerprintJS tries to determine which are supported, which are not, and for each technology, the result of the survey of the availability of this technology and the degree of its support is added to the final function of the print.

SessionStorage, LocalStorage, IndexedDB, OpenDatabase and others.

User-specific and platform-specific data are polled, such as the doNotTrack setting (it’s very ironic that the doNotTrack setting is used just for tracking), processor cpuClass, platform and other data.

Here you may have a logical question - after all, for many users, this data is the same? Suppose a user lives in Moscow, he will have the same language, the last Chrome, he will have everything almost the same, and all these lines that were received at this stage will be the same. How does this help identify the user?

There are 2 more ways that add uniqueness.

The first is the information about the plugins. The code polls the presence of all installed plugins in the system. For each plug-in, its description and name is obtained. And also, which is very important, a list of all multimedia types or main types that support this plugin. All this information is combined into a huge array of strings, and this array is also concatenated and added to the fingerprint string. As you understand, each computer has its own list of plug-ins, quite unique, and plug-in versions can have their own, and the list of supported main types will also be different.

The following is added to the print line. Canvas Fingerprint. This is another technique that improves accuracy. Its essence lies in the fact that the hidden Canvas element draws certain text with certain effects imposed on it. And then the resulting image is serialized into a byte array and converted to base64 by calling canvas.toDataULR ().

Here you have a question: how does this also help identify? Unexpected to me was a study that I found. It says that drawing fonts, in particular, in the Canvas API, is very platform dependent. Externally identical identical images drawn in different browsers will be converted to a different byte array. Why? It depends on the processor, video card, video card drivers, system libraries such as direct X, font rendering systems, shadows - all this can be different on each computer, so the resulting byte array will be different on almost every computer with different hardware and software. filling. And this long string obtained during the serialization of Canvas will be attached to the final print, and we will get a huge string.

This is how it works. We get all this data. Then we pass them to the hashing function, nomo hash2 is used in FingerprintJS, and the output is a 32-bit number. This is your ID. Thus, when a user visits our site, a number is assigned to it. You read this number and use it as you wish - you base your analytics on it, etc.

Here is the question: how unique and precise is the definition? The study, which was taken as a basis, was done by the Electronic Frontier Foundation, they had a project called Panopticlick. It says that uniqueness is about 94%, but on real data that I had, uniqueness was about 90% -91%.

The library began to use a lot of people and companies, and over time a number of shortcomings emerged. Those. she is not perfect, she has flaws. The main disadvantage is that the identification accuracy is only 90%, but there are other disadvantages.

UserAgent. Modern browsers UserAgent changes very often, every two months a new version of Google Chrome is released. UserAgent will change, because the version of Google Chrome that protection in UserAgent will differ. This means that UserAgent will affect the final footprint. If a new browser comes out, the final fingerprint will change, because from the point of view of FingerprintJS it will be a new user.
IPhone, iPad and other Apple products. The fact is that they are very uniform, the same in terms of hardware implementation. They all have the same processors. If we take, of course, a single model, say IPhone 5S, all IPhone 5S will have the same processor, the same graphics accelerator and the same system libraries, and the plug-ins will be the same there, but in fact they are not there. This means that the byte array received from Canvas Fingerprint will be the same for all versions of the iPhone 5S, which means that the identification accuracy for Apple products will be lower.
Internet Explorer and China. I didn’t realize that there was a problem, but then I learned that there are a lot of old versions of IE used in China, and in order to get a list of plugins, you need to have the names of all plugins in advance. Because in order to check if there is a plugin or not, in IE it is simply impossible to call, for example, navigator.plugins. This will not work. You need to take each plugin and try to instantiate it as an active ex object. If it was created, then all is well. If an error is thrown, it means that no plugins are installed in IE. I had a list of plug-ins for IE, but it was short - about ten plug-ins. I did not have the definition of those plug-ins that are popular in China, such as QQ, baidu, etc. There are a lot of plug-ins that are used only there. I did not check these plugins, and the list of plugins for China specifically was smaller.
Another disadvantage of the first version is the lack of integration with Flash and Silverlight, and integration with these plugins allows you to improve the quality of the Fingerprint.
And the last, but rather serious thing that recently hit FingerprintJS, is that, starting from version 42, Google Chrome just stopped activating all those plugins that work through NPAPI. NPAPI is a very old API for instantiating plugins, it was also developed by Nextkey. It is called “Nextkey plugin API”. All plugins that work and rely on this protocol, this API, stopped loading, and therefore neither Silverlight nor Java, and these two plugins are the most popular, which work through NPAPI, are not displayed in FingerprintJS - they are not defined in any way, and the list of their main types is also not displayed. This means that in Chrome 42 and older, the accuracy of FingerprintJS is reduced due to this problem.

Therefore, after analyzing all this, and now I also use FingerprintJS, I came to the conclusion that it is time to develop a new library that will be virtually devoid of all existing shortcomings.

I started doing it quite recently, the development is underway on github.

How does she solve existing problems? The most important thing is using phase-mixing or localsensitivehash, or fuzzy hashing. Such a hashing that does not change, even if in the usual hashing, if you change at least one byte of the incoming information, the outgoing line also changes, and in a fundamental way. This does not happen in phase-mixing, there is a sensitivity threshold, when a certain percentage of incoming data may change, which will not affect the outgoing print. Suppose if only the browser version has changed in UserAgent, this happens very often, say, in Chrome, then the resulting print will be the same, because the version is 3 or 5% of the total length from UserAgent.

The second is that FingerprintJS 2 uses the definition of installed fonts, all fonts that are installed on the system. How is this useful? If you installed a program, say adobe pdf, then you add fonts to the system.

If you installed Microsoft Office, you add fonts to the system; if you installed any Quick office that has its own fonts, you again add fonts to the system. And so you can have two absolutely identical computers, but Office is installed on one, but not on the other. This means that on the first, where there is no Office, there will be 320 fonts available, and, where there is Office - 1700 fonts. And it means that you can get all the fonts that are on this computer, again, for the final print. It will be two different prints, because the fonts are different.

The default is Flash, a small swf file with a size of 916 bytes. It receives a list of all installed fonts, and in a platform-specific order, because they are available in the system, so they will be returned. If Flash is not installed, this technique is used, it is called site chanel technic. It was first published on lalit.org. This is the definition of font availability using javascript only. How it's done? For each reference font, which is set by default in the browser or in the system, its width and height are measured, and this array of width and height is saved. Then a different font is applied to the hidden text (the text, by the way, is huge, say, 72 pixels). If this font is in the system, the text will change its dimensions correctly, and the code that changes the height and width will receive a new array, with height and width. If it is different from the reference one, from the one that was received when requesting the default font, then this font is installed. If not different, then this font is not.

A very simple idea, but it works. At the moment, this code can reliably identify about 500 fonts without using Flash. And, accordingly, the computer where Microsoft Office is, and the one where it does not exist, will be differently defined in FingerprintJS 2 at the expense of this technology.

The third difference is WebGL Fingerprint. This is the development of the idea of Canvas Fingerprint. Its essence lies in the fact that 3D triangles are drawn (on the slide it is not very clearly visible, but this is 3D). Effects, gradient, different anisotropic filtering, etc. are superimposed on it. And then it is converted to a byte array. The resulting byte array, as in the case of the Canvas Fingerprint, will be different on many computers. Then to this byte array is added information about platform-specific constants that are defined in WebGL. Those. WebGL has a set of constants that must necessarily be in the implementation. This is the depth of color, the maximum size of textures ... There are a lot of these constants, there are dozens of them. The code polls all these constants and, of course, that on android devices these constants will be different, there the color depth may be different than on Windows or on linux.He polls all these constants, all this again adds up to a huge array, and all this is added to the serialized 3D image of the triangle, which is drawn using hardware effects.

Here is also the question: how does this help to identify? 3D graphics are very platform-specific, the driver version, the video card version, the OpenGL standard in the system, the shader language version — all of this will affect how this image is drawn inside. And when it is converted to a byte array, it will be different on many computers.

Why is WebGl Fingerprint important? Because iOS 8.1 supports WebGL, and this helps identify iOS devices that I mentioned the identification problem. Therefore, WebGL improves the accuracy of Fingerprint.

What remains to be implemented?

As I said, the library is in development and not all the things that I would like to do in it are done. There is already a small community of developers around it. By the way, I invite everyone to participate in the development - it is very interesting, we are very informal, everyone offers ideas, it’s quite interesting there.

What remains to be implemented? WebRTC Fingerprinting.

WebRTC is a standard for peer-to-peer communications via audio streams, or it is a standard for audio communications in modern browsers. It allows you to make audio calls, etc., it is supported in FireFox and will soon be supported in other browsers.

The implementation of the WebRTC standard is also platform dependent, it will depend on the video card installed in the system, on the sound drivers, etc. Therefore, by measuring different levels of latency, different levels of support for WebRTC and constants that are protected in this format, you can get different final prints for different computers.

More plugins for IE will be used. Those plug-ins that are popular in different countries - China, India, etc. will be used, i.e. growing information markets. In the first version, not enough attention was paid to this problem, and in the second version this will be solved.

More information will be collected about the OS. How are we going to do this? Integration with Flash and Silverlight will be used. Flash allows you to get information about the system, such as kernel version, kernel level patch. Silverlight, if on Windows, allows you to get the Windows version, bild, Windows number, all this is available through Silverlight.

A few words about Silverlight, why is the integration with sliverlight important too? Maybe the Silverlight plugin in Russia is not very popular, but in the US, for example, there is a Netflix streaming video service that broadcasts video, and I know for sure that they use Silverlight. In view of the fact that it supports DRM (this is a system for restricting digital rights to content), because Netflix often shows different fresh Hollywood movies, they use Silverlight to ensure that this video does not diverge over the Internet. Therefore, in the US, many desktop users of the Internet have the Silverlight plugin installed, which, by the way, is also available on the Mac, besides Windows.

The detection of multiple monitors will be implemented. If we request sizes via javascript, we will get just two numbers - this is the width and height of the screen. If we do the same through the Flash API, Actionscript API, we get an array of arrays. This means that if several monitors are installed, where each subarray is, this will be the size of the screen of each monitor. If the developer is sitting at five monitors, he will receive an array of arrays of five elements, that is, we will find out that the person is sitting at five monitors, and not at the main monitor, which would return javascript.

All these data together make it possible at the moment to obtain an accuracy of determining about 94-95%. But, as you understand, this is an insufficient accuracy of identification. Here the question arises: how can this be improved, and can it be improved? I believe that it is possible. The goal of this project is to achieve 100% identification so that you can rely on Fingerprint in 100% of cases and is guaranteed to say: “Yes, this user came to us; yes, I know everything about him, despite the fact that he uses incognito mode, tor network ... ”. It does not matter, all this will be determined.

Contacts

» Github
» Machinio.com

— - FrontendConf .

— " - ", , FrontendConf .

Source: https://habr.com/ru/post/321294/

All Articles

Browser Fingerprint - anonymous browser identification

Valentin Vasiliev ( Machinio.com )

Contacts

More articles: