
Have you ever thought about what principle shows targeted advertising to you? Why, not even liking anything while surfing, when you go back to Facebook, do you see ads related to the websites you visited? And who is interested in tracking users? As part of my training project, I had to find out which companies are behind tracking site visits, and what they use to do this without attracting much attention.
Why discrimination
What is meant by the discrimination of users on the Internet? This is when the prices of goods in online stores change depending on which devices are used to view the catalog. A violation of privacy begins when the sites show you advertising funds for insomnia, because you stay up late, as this indicates that the data on the time of your stay on the Internet are transmitted to third-party companies.
What is web beacons
A web beacon (in the English version of “web beacon”, or “1x1 pixel image”) is a tiny or transparent picture that is embedded in the page and used to track user actions.
')

Such invisible beacons can be used not only for web analytics, but also for collecting aggregated information in order to sell it to third-party sites, to build social graphs. Another way to use a web beacon is to check that the email has been read. The sender learns about this as soon as a picture has been requested for a specific link, and the addressee may not notice that it was in the body of the letter.
Data collection and statistics
As initial data I had several JSON files with links to pictures (both statically and dynamically loaded) from the top 800 domains (according to
ALEXA version). It remained to develop a script that parses these files, follows the links, uploads pictures and stores information about them in the SQLite database.

These JSON files contained all links to pictures, like 1st party (pictures are on the same site where the link to them is placed) and 3rd party (pictures are stored on third-party sites). And if in the first case, the beacons can be used completely for innocuous purposes (for web analytics within the site), then in the second case several parties are involved, and this is cross-site tracking. Since I was interested in the latter case, I used the
tld library to extract the top-level domain.
The script works as if all cookies were cleared before each request, therefore in the initial requests to the servers the Cookies field is empty. If the response from the server contains a filled in set-cookie field, this value is entered into the database.
There are two ways to calculate a beacon: check the size and check the field in the HTTP content size header. But not all responses to requests contain the content length and content type fields, since they are optional and may even contain incorrect data. Also there are lighthouses, which, with a size of 1x1, are returned in a package with content length> 100, since the picture is of PNG format. Therefore, when plotting graphs, I did not take into account the value of content length.
What if there is no picture in the answer? It happens that the server returns the status 204. This means that there is no content, but, nevertheless, the passage through the link is fixed. Therefore, if the status is 204 and the content type in the HTTP header contains “image /”, the script assumes that this is a web beacon and places the values ​​width = 0 and height = 0 into the database. Such beacons were encountered by 37,294 (1.53%).
A total of 8,586,314 references to pictures were checked, the database contains data on 5,873,372 3rd party pictures, of which 2,441,277 beacons (41% of the number of third-party pictures are web beacons!).
And some more statistics
The
image_domains table stores information about image providers (that is, not those 800 top sites with links to pictures, but servers that directly store these pictures).
Number of domains: 800
Number of domains where at least one lighthouse met: 760
Number of pages: 124,214
Number of pages where at least one lighthouse met: 111 442
Number of image providers: 4,348
The number of providers of images-beacons: 1 325
And the fact that there are 40 domains on which no lighthouse has met, does not mean that they do not use them. Probably, they use non-standard size beacons (1x2, 3x1), which were also encountered during random check of links.
Top players in the web tracking market
So, in the base of 2 431 277 lighthouses. It is interesting to find out which of the 1,325 provider beacons are most often found on the pages of the top 800 domains.
def plot1(dbname, condition):
Here
condition is “width <= 1 and height <= 1”.
It is worth noting that, since the same lighthouses found may appear on different pages, the number of lighthouses is not equal to the number of unique lighthouses, that is, there may be duplicates in the
images table (the
url field is not unique).
On the x-axis - providers of beacons, on the y-axis - the number of beacons.

Now look at the number of pages among the top 800 domains that are tracked by beacon providers. That is, on each of these pages there is at least one lighthouse.
SELECT image_domains.domain, count(distinct images.id_pages) as pagescount FROM images INNER JOIN image_domains ON image_domains.id = images.id_image_domains WHERE width<=1 and height<=1 group by image_domains.domain order by pagescount DESC;
On the x-axis - providers of beacons, on the y-axis - the number of pages.

Below you will see the most interesting graph that shows how many unique domains (the very top 800 ones) are tracked by beacon providers, that is, each such domain has at least one page with at least one beacon.
SELECT image_domains.domain, count(distinct pages.id_domains) as domainscount FROM images INNER JOIN image_domains ON image_domains.id = images.id_image_domains INNER JOIN pages ON images.id_pages = pages.id WHERE width<=1 and height<=1 group by image_domains.domain order by domainscount DESC;
The x-axis is the beacon providers, and the y-axis is the number of domains.

The last graph shows that, in general, providers are categorized as “web analytics” and “advertising”. But it is interesting that there are beacons from the search engine Google and social networks Facebook and Twitter. These providers are most interesting because if the user is logged in, such tracking is not anonymous.
Invisible tracking images on the example of Facebook and Google
Facebook pixel
Facebook pixels are used for cross-analytics; Every Facebook user can create their own beacon in the
Adverts Manager . And in principle, it is clear why this is free because Facebook has its own benefit: it receives data on visits from users of various sites and can use them for targeted advertising.
A total of 751 unique Facebook pixels were detected in the database, which were found on 59,023 pages on the Internet.
The following experiment will show how this works. You can create a test html page with some Facebook pixel or find a site where it is. Then you need to go to your browser settings and delete all saved cookies before starting. You also need to allow the storage and sending of third-party cookies in your browser settings.
If the user is not logged in to Facebook, all that is sent in the header of the Send-Cookie to Facebook pixel is the “fr” field (apparently, the user's location).

In this case, anonymous statistics are collected that do not violate the user's privacy. But what if the user is logged in to Facebook? In this case, values ​​are sent that identify the user.

It turns out that Facebook knows which third-party sites are visited by users, although, of course, we agree with this (see the
cookies policy on Facebook ).
Google beacon
Google also keeps track of users who are authorized in their services. In the screenshots you can see surfing on one of the major discounters, where I met the lighthouse from Google. After logging in to Gmail, surfing stopped being anonymous, as user identifiers “SID”, “HSID”, “SSID”, “APISID”, “SAPISID” began to be sent.
Before logging in to Gmail:

After logging in to Gmail:

How to protect
Whether it is necessary to defend each one decides for himself, personally I don’t want the information about the sites I visited to be sold to third parties without my knowledge (see cookie syncing, cookie matching). Consider if there are any ways to protect.
Clear cookies every time
With each new session, you can clear cookies by receiving new identifiers from trackers. But, as shown in the last section, it is useless if you are logged in to the next tab, for example, on Facebook. In addition, the browser fingerprint technique (fingerprint) is now popular, which allows
you to
reidentify hosts that have cleared cookies. There is even such a thing as “perpetual cookies” - these are techniques that interfere with the user's attempts to cover his tracks by re-calculating his identifiers. This is achieved by duplicating cookies in HTML5 LocalStorage, Flash LSOs, in the Etags and fingerprint cache. Browser user identification techniques continue to improve, a
work on an inter-browser print has recently appeared.
JavaScript lock
Disabling JavaScript support is effective against trackers that require API access to collect data, but it’s useless if the tracker uses HTML redirects and simply sets cookies via the HTTP header. In addition, web beacons are easily downloaded without scripts. Here, for example, as it is implemented in Facebook pixel:
<script> !function(f,b,e,v,n,t,s){if(f.fbq)return;n=f.fbq=function(){n.callMethod? n.callMethod.apply(n,arguments):n.queue.push(arguments)};if(!f._fbq)f._fbq=n; n.push=n;n.loaded=!0;n.version='2.0';n.queue=[];t=b.createElement(e);t.async=!0; t.src=v;s=b.getElementsByTagName(e)[0];s.parentNode.insertBefore(t,s)}(window, document,'script','https://connect.facebook.net/en_US/fbevents.js'); fbq('init', '777', {em: 'insert_email_variable,'}); fbq('track', 'PageView'); </script> <noscript><img height="1" width="1" style="display:none" src="https://www.facebook.com/tr?id=777&ev=PageView&noscript=1" /></noscript>
Do not track
DNT is an HTTP header that allows you to bypass the tracking of user actions by sites. But in reality there is no guarantee that the “do not track” request will be satisfied. Moreover, this flag is even used as one of the many parameters in fingerprint techniques for more accurate identification of the browser ...
Block third-party cookies
Fortunately, browsers have an option to block third-party cookies. Disabled by default, it is well hidden in the depths of the settings. The hardest thing was to find it in Chrome.
You can disable it in
Settings / Show advanced settings ... / Content settings / Block third-party cookies and site data .
In Firefox:
Options / Privacy / Uses custom settings for history / Accept third-party cookies: Never.In Safari:
Preferences / Cookies and website data / Allow from current website onlyIn Opera:
Settings / Cookies / Block third-party cookies and site dataIf you enable this option and repeat previous experiments, you can see that third-party cookies (Facebook and Google, respectively) will not be accepted and sent. It would seem, you can calm down on this, but what if the identifiers are stored somewhere else (remembering the "eternal cookies")? Then, after logging in to Facebook, identifiers can be stored not only in standard cookies, but also duplicated in the browser’s local storage and then from another site the code responsible for downloading the web beacon can receive these identifiers and link them to the user. This would just help to disable Javascript ...
findings
It turns out that web tracking through invisible images is quite widespread. Beacons, while remaining invisible to the user, may have non-standard sizes (for example, 1x5) and be of different formats. And although modern browsers have the ability to block sending and saving third-party cookies, by default this option is disabled and not a panacea: with the development of web technologies, we may fear that services will use other methods everywhere to store user identifiers, because this is their bread.
And if before no one on the Internet had guessed that you were a cat, now this, alas, is not so.