What else can a browser know about you?

Everyone knows that being inside the browser, it is impossible to extract a sufficient amount of information about its user using simple JavaScript. Service information, such as the name of the browser engine, the operating system and their versions, although it gives a general idea of the user (and the audience as a whole), is still not comprehensive.

For a comprehensive user analysis, User-ID is used in Universal Analytics, but using independent software components that are running and are somewhere in the computer’s memory next to the browser, you can also collect data about the user. The information obtained directly from the browser’s memory will allow analyzing both the individual user and the entire audience. Here will be considered a family of browsers on the Webkit engine and a specific example of the Google Chrome browser.

Browser as a repository of interesting information

Every day, millions of people trust their web browser with the most intimate information: personal and bank data, lists of selected sites. Most of the really “tasty” information (first of all, for intruders) is hidden by the browser itself (password managers with encryption) and web resources that people use with a browser (securely written and debugged code, SMS alerts / confirmations etc). But beyond that, data remains open and easily accessible, such as the source code of the pages. After all, this is where most of the material is located, on the basis of which a comprehensive analysis of users can be made. And this material is not only one of a technical nature.
')
By themselves, browsers (especially based on Chromium) are designed in such a way that as much as possible about the users “leaked” information to the outside world. Ie, the developers are trying in every way to protect people from all sorts of, even if irrelevant leaks. Google Chrome, for example, creates its own sandbox for each individual tab as a separate process. Details can be found by clicking on the local link: chrome: // memory It is logical to assume that each such separate tab-process stores the source code of its own page.

How to find out what the browser knows?

The process of the browser is no different from other processes of the operating system, and therefore it also stores all its data in RAM and is characterized by the same sectoral memory principles. Software components, being launched within the OS, are on the same software level as the browser. This makes it possible to read the memory of the browser, remove its dump .

For the demonstration, a small test project was created in C # that allows you to collect source code from the browser's memory. The source code of the project can be viewed in the BitBucket repository. The utility for direct use is here .

Now it’s worth understanding what the quality of information collection from the source codes of open browser pages located in memory depends on. Chromium is an OpenSource project, so digging into its source code is enough to clarify many basic aspects.

For example, how the page of the web document is arranged. In the Document class, which is part of the WebKit engine, there is such code (C ++):

... // "body element" as defined by HTML5 (https://html.spec.whatwg.org/multipage/dom.html#the-body-element-2). // That is, the first body or frameset child of the document element. HTMLElement* body() const; // "HTML body element" as defined by CSSOM View spec (http://dev.w3.org/csswg/cssom-view/#the-html-body-element). // That is, the first body child of the document element. HTMLBodyElement* firstBodyElement() const; ... HTMLHeadElement* head() const; ...

From this piece it can be judged, for example, that the browser stores all information about the entities of a web page in objects that are hierarchical in their structure. This can be seen by looking into the catalog of all HTML elements known to the WebKit engine.

Combining this information with knowledge of how the operating system allocates memory for a new program object, it turns out that the HTML of the page can be in memory in a completely fragmented form.

Memory scan and intelligence isolation

Suppose one of the users has opened a new tab in his browser ( this one ). What in this simple case, it would seem, can be extracted?

For example, what the user was looking for in a search engine, and what has not yet been erased by the browser from memory:

  <div class="suggest2content suggest2contentthemelarge" style= "min-width:693px;"> <div class="suggest2group"> <div class="suggest2title"> <span class="suggest2a11y"> :</span> </div> <ul class="suggest2items"> <li class="suggest2-item suggest2-itemtypetext i-bem" data-bem= "{"suggest2-item":{}}"><span class= "suggest2-itemtext"><!-- ,    --> <b>  </b></span></li> <li style="list-style: none">...</li> </ul> </div> </div>

You can find out some personal data. For example, the mailbox of a specific user:

 <div class="..."> <a href="https://passport.yandex.ru/passport?mode=passport" tabindex= "0"><span class="usericon"></span> <span class="username"> <!--  ""    --> <span class="userfirst-letter">n</span>astya.ivanova</span> <span class="notice noticemoreno i-bem noticejsinited" data-bem= "{"notice":{}}"></span></a> <div class="popupcontent"> <div class= "b-menu-vert dropdown-menumenu dropdown-menumenuthemeffffff"> <ul class="b-menu-vertlayout"> <li style="list-style: none">...</li> <li class="b-menu-vertlayout-unit"> <div class= "b-menu-vertitem b-menu-vertitemthemegray multi-authaccount"> <span class= "link multi-authaccount-link user userblankyes useraccountyes" tabindex="0"></span> <div class="usericon"> <span class= "link multi-authaccount-link user userblankyes useraccountyes" tabindex="0"></span> </div><span class= "link multi-authaccount-link user userblankyes useraccountyes" tabindex="0"><span class="username"><span class= "userfirst-letter">n</span>astya.ivanova</span></span> </div> </li> <li style="list-style: none">...</li> </ul> </div> </div> </div>

Using the dump analyzer, we also managed to get the entire source code for the page. Apparently, somewhere in the program code of the browser, a line is formed containing all the source code of the web page. It can be analyzed both manually and with the help of various HTML parsers.

Often, useful information is obtained through the mechanism of the so-called "intersite pairing". For example, when certain web resources support authorization through a Google+ account, Facebook account, etc. In this case, it will be possible to get the analytical data that are closest to a specific user.

Suppose the target user has a google account and this user has commented on a specific entity on the site that is linked to his Google account. In this case, most of the google user information will be encapsulated inside the source code of the web page. Here is an approximate part of the dump, which can be separated from most pages integrated with Google:

 <div class="SR"> <h3 class="zi"><!--      --> <a class="ob tv Ub Hf" href="./113137362950198752663">Nastya Ivanova</a></h3>   YouTube </div><span class="uG Ve"><span class="ds Vt Hm dk Q9" tabindex="0" title= " ">   </span>  -  <span class= "uv PL"><a class="oUs FI Rg" href= "113137362950198752663/posts/T8xcx2jKjGP" rel="noreferrer" style= "display:none" target="blank">2014-04-05</a></span></span> undefinedundefinedundefined <div class="Al pf"> undefined <div class="Xx xJ"> <div class="Ig At dn"> <div class="Bt Pm"> <div class="tG QF"> </div><!--   --> <div class="Ct"> Dear not russian speakers! For full understanding the Soviet Anthem you should hear and understand it on russian language. Interpretation is not so bad but meaning is distorted here and there. And thats because English is not so rich language as Russian I think ;) </div> </div> </div> </div> </div>

At least managed to get a link to the user profile, his name and surname. On the basis of this data alone, you can deploy a large-scale collection of information about the user.

What is the result?

It often happens that large web projects launch separate applications for their clients. These applications, as a rule, function with the browser at the same level - at the level of the operating system. This approach allows us not only to simplify certain functions for the user, but also to collect valuable analytical data about him, which it is rarely possible to collect within the browser itself using standard web analytics methods. Some manufacturers of client applications may even receive deeply personal data from the browser’s memory in order to “find out” their user to the maximum.

Inside the browser memory contains a huge amount of information about the user. You can isolate the base pieces with a simple process memory analyzer. And in these pieces may contain a wide range of information. Ultimately, it is these basic pieces that can form the basis for a comprehensive analysis of both an individual user and an entire audience.

Abstract

The source code of the project that collects information about the user can be viewed here .
An analogue of this process is the use of User-IDs in Google Analytics.
The most accurate data about the user can be obtained through cross-site conjugation (social network accounts).

Source: https://habr.com/ru/post/263063/

All Articles