
Last summer's events related to leakage of confidential data to search engines directly or indirectly affected everyone who follows news courtesy of the media. Under the "system knife" got search robots and personal data of a citizen of the Russian Federation. Dig a little deeper and find out how privacy can be "in plain sight."
In the midst of the summer period, the media vied with each other to share news about the leakage of information from web pages containing SMS texts of users of one of the largest cellular operators. Among the many cunning and more or less adequate versions concerning the reasons for this fact, a thought finally emerged, which was supported by the results of the research. The forum
member rdot.org under the pseudonym
“C3 ~ RET” published a small
research paper , which, thus, set another, rather good vector for new publications in the media.
')
At the same time, while researchers separated flies from cutlets, diggers made up more and more search engines compromising the owners of web resources who processed personal data, as well as owners of search engines who allegedly mercilessly consolidated these data on their servers. Examples of such leaks include the following information cached by search engines:
1) personal data of sex shop buyers;
2) documents with the stamp "for official use";
3) deleted photos of the social network VKontakte;
4) personal data of customers of Russian Railways tickets.Surface situation
Was there really an incident? The direct collision between web resources and the search engine, beautifully described by the “propaganda department”, is in fact only a
“Google Hacking” technique demonstrated to the masses only by the example of another search engine.

The well-known fact began to be highlighted: the documented functionality of the search robot forbids it to index pages that are not contained in the special file
“robots.txt” in the root web resource or have a special HTML tag
“noindex” , which, by the way, is offered by
Yandex . However, if only the “scapegoat” in the face of Yandex used exclusively documented means of indexation, then perhaps the hype would quickly cease, and this material would not be published ...
In search of truth
The key findings of the researchers and the preliminary version of the causes of the incident, primarily related to the leakage of SMS to the search engine cache, are formulated like this: on the client side of the short message sending gateway, the installed extension to the browser from the search engine developer’s company provoked the transfer of all user data search engine. A cursory inspection of the source code of the SMS gateway web page, which
Google ’s a kind cache provided after the cellular operator temporarily closed access to the gateway, made sure that the specified page from which the messages were sent contained a JavaScript code that was part of
the Yandex.Metrika service , which is a tool for webmasters and is designed to generate statistics about visitors to a web page. Thus, we have another potential channel of confidential information leaking into the
Yandex search engine cache.
Acquaintance with the user agreement finally lands a paranoid on treason in the following paragraph:
"eight. ... The recording function of the Sessions of visits is fully automatic and does not know how to analyze the content and meaning of the information placed on the pages and entered by third parties (visitors) in the fields on the pages of the User’s site, and records it completely regardless of the content. ... "In other words: the user understands that the
Yandex.Metrics script records everything without parsing. Our task is to determine specifically what it is.
To test this hypothesis, three pages were created containing the above-mentioned JavaScript and unique text for the purpose of their subsequent identification in search results. The first page of the form
defetech.ru / p4ste3stOfYa was open to the search engine
Yandex . The other -
defectech.ru/4tom1cprOfit is listed in the
robots.txt file. The addresses of the pages exclude the possibility of selecting links if this functionality is available to the search bot (which will allow us to reason about whether there was a so-called intelligent scan in the incident in question). In order to more promptly notify the bot of the appeared public page (
defectech.ru/p4ste3stOfYa ) a link to it was placed on the root page (
defectech.ru ).
Two days later. The consoling results for us: the placement of the JS code in no way affected the indexation of pages that were closed to search bots. Moreover, the indexing of public pages is also not done, which can only speak about extremely short periods of presence of test pages in the perimeter of the World Wide Web, which did not allow the search robot to reach for unique data. However, the code itself of the analytical system for the web developer did not provoke page indexing, which means that, at a minimum, at this stage it cannot be a channel for data leakage to the search engine. According to the representatives of Yandex, after detecting the fact of a leak, the corresponding corrections were made in
Yandex.Metrica (
http://bit.ly/qFtSs5 - comments from representatives of the search engine). Details are not specified. Moving on.
Not one Yandex is full of many search engines, for this reason it was decided to superficially inspect the actions of similar JS-code from Google. A quick inspection of the sniffer logs showed the results of the activity of this script, which consists in transmitting statistical information about the user's actions to the
Google Analytics service. Among this information is the address of the page where the code is placed, which means that Google at least “knows” about the presence of this page (the code is used to generate statistics displayed by Google on the resource’s admin panel, but who knows where still leaving the data).
Gas discharge effect
As a result of analyzing the behavior of additional software provided by search engine developers in the form of browser extensions, third-party activity of these extensions was revealed, which could lead to information leaks to third parties. The magnitude of the effect, which initializes the leakage channel of confidential data, can be visualized if we draw an analogy with the gas decay effect. Without getting into the wilds of the exact sciences, we will look at a picture demonstrating the process of this decay: under the influence of an external force, a group of elementary particles collides with a group of other elementary particles; they, in turn, make similar collisions and as a result we see an avalanche effect. In our case, elementary particles are confidential information or a part of it, which through leak channels enters different parts of the network and eventually becomes publicly available due to its prevalence. Thus, impersonal personal data as a result of entering the search engines can uniquely identify its owner.

Let's take a closer look at data leakage channels that open the above extensions for user browsers from search engine developers using the example
of Yandex.Bar and
Google.Toolbar .
Investigating the activity of
Google Toolbar , in the sniffer logs, you can also notice a line in which the address of the user's current page is sent to one of the Google services. Moreover, if you do not perform any actions in the browser, you can notice the background activity of the plugin: with a certain frequency, one of the Google services is called with a suggestive name
"safebrowsing" . The frequency of calls is due to the fact that the plugin sends sufficiently large data.
Google Toolbar background activity.
Yandex.Bar simultaneously transmits data over HTTP despite the transfer of data over an HTTPS connection.We are switching our attention to a similar product from Yandex. To make the results of the study “hotter”, let's consider the behavior of the plug-in under combat conditions: let's make a payment operation in the popular Internet banking system
“HandyBank” installed by one of the major Russian banks.
Classic genre: plugin sends the page address to the search engine service. Enter the details of the payment transaction and click "Next." The script of the Internet banking system transfers all the entered information in a GET request and this provokes the transfer of confidential information to the Yandex service. The jamb ... And, from the vendor's side of the system.
“Yandex. Bar” in addition to fast search provides the ability to check the spelling of data entered by the user on the current page. We activate this option and what we see: the plug-in transmits all the information about the details of the payment operation in a specially compiled XML file to the spell-checking service from Yandex (the structure of this file is clearly visible in the corresponding screenshot). And all this Ya.Bar transmits to their servers to bypass HTTPS connections via an unprotected HTTP protocol.

In the eyes of the analyst
Search engines resemble an octopus trying to reach its most remote repositories of information in the web space with its tentacles. These tentacles are by no means limited to search bots and are nothing more than various plugins, web developer tools and other useful services, on the one hand, courtesy of the search engines themselves to make life easier for end users, and on the other hand oriented indexation procedures. "All Internet".
Particular attention to the news, which served as the starting point for monitoring the activities of search engines, is quite possibly indirectly caused by events unfolding around
FZ-152 , the changes of which have shaken the Russian information industry. The published SMS correspondence and the
megaphone set in a corner is a vivid demonstration of the allegedly poor security of personal data, to which, according to the updated definition, now almost any information created by a human being is related. But that's another story.