📜 ⬆️ ⬇️

Web analytics "black holes": how much data is lost in GA and why

image

If you have ever compared data from two analytical tools on the same site or compared analytics with reports and sales, you probably noticed that they do not always match. In this article I will explain why there are no data in the statistics of web analytics platforms, and how large these losses can be.

As part of this article, we will focus on Google Analytics, as the most popular analytical service, although most of the analytical platforms implemented on-page have the same problems. Services that rely on server logs avoid some of these problems, but they are so rarely used that we will not touch them in this article.

')

Test Analytics Configurations in Distilled


On Distilled.net, we have a standard Google Analtics resource, working from an HTML tag in Google Tag Manager. In addition, in the past two years I have used three additional parallel implementations of Google Analytics, designed to measure the differences between different configurations.

Two of these additional implementations — one in GTM and the other on-page — manage locally stored, renamed copies of the Google Analytics JavaScript file (www.distilled.net/static/js/au3.js instead of www.google-analytics.com/ analytics.js ) to make it harder to find ad blockers.

I also used renamed JavaScript functions (“tcap” and “Buffoon” instead of standard “ga”) and renamed trackers (“FredTheUnblockable” and “AlbertTheImmutable”) to avoid the problem of duplicating trackers (which can often lead to problems).

Finally, we have the configuration “DianaTheIndefatigable”, which has a renamed tracker, but uses standard code and is implemented at the page level.

image

All our configurations are shown in the table below:

image

I tested their functionality in different browsers and ad blockers, analyzing pageviews that appear in the browser’s developer tools:

image

Causes of data loss


1. Ad blockers


Advertising blockers, mostly in the form of browser extensions, are becoming more common. Initially, the main reason for using them was to improve performance and experience of interaction on sites with a lot of advertising. In recent years, increased emphasis on data privacy, which also contributed to the growing popularity of adblockers.

The impact of ad blockers

Some adblockers block the default web analytics platform, others can be further configured to perform this function. I tested the Distilled site using Adblock Plus and uBlock Origin, the two most popular desktop browser extensions for ad blocking, but it's worth noting that adblockers are also increasingly used on smartphones.

The following results were obtained (all figures refer to April 2018):

image

As can be seen from the table, the modified GA settings do not help much in opposing blockers.

Data loss due to ad blockers: ~ 10%

Using adblockers can be at a level of 15-25% depending on the region, but many of these settings are AdBlock Plus with default settings, at which, as we have seen above, tracking is not blocked.

The share of AdBlock Plus in the market of ad blockers varies in the range of 50-70%. According to the latest estimates , this figure is closer to 50%. Therefore, if we assume that no more than 50% of installed adblockers block analytics, then we will get data loss of about 10%.

2. “Do Not Track” function in browsers


This is another feature whose use is motivated by privacy protection. But this time it's not about the add-in, but about the functions of the browsers themselves. Running the Do Not Track request is not required for sites and platforms, but, for example, Firefox offers a stronger feature under the same set of options, which I also decided to test.

The influence of "Do Not Track"

Most browsers now offer the option to send a “Do Not Track” message. I tested the latest releases of Firefox and Chrome for Windows 10.

image
Again, it seems that the modified settings do not help much either.

Data loss due to “Do Not Track”: <1%

Testing has shown that only the Tracking Protection feature in Firefox Quantum browser affects trackers. Firefox takes up 5% of the browser market, but tracking protection is not enabled by default. Therefore, the launch of this function did not affect the trends of Firefox traffic on Distilled.net.

3. Filters



The filters that you set up in the analytics system may intentionally or unintentionally underestimate the amount of traffic received in the reporting.

For example, a filter that excludes certain screen resolutions, which can be bots or internal traffic, will obviously lead to some underestimation of traffic.

Data loss due to filters: N / A

The impact of this factor is difficult to assess, because this setting varies depending on the site. But I highly recommend having a duplicate, “main” view (without filters) so that you can quickly see the loss of important information.

4. GTM vs on-page vs wrong code


In recent years, Google Tag Manager has become an increasingly popular way to implement analytics because of its flexibility and ease of change. However, I have noticed for a long time that this method of implementing GA can lead to an underestimation of indicators in comparison with setting at the page level.

I was also curious about what would happen if I didn’t follow Google’s recommendations for setting up an on-page code.

Combining my own data with the data from the site of my colleague House Woodman (Dom Woodman), which uses the analytical extension Drupal, as well as GTM, I could see the difference between the Tag Manager and the wrong code on the page (placed at the bottom of the tag). Then I compared this data with my own GTM data to see the full picture of all 5 configurations.

Impact of GTM and misplaced on-page code

Traffic as a percentage of baseline (standard implementation using Tag Manager):

image

Main conclusions



It is also worth noting that user implementations actually receive less traffic than standard ones. In the case of the on-page code, the losses are within the bounds of error, but in the case of GTM there is another nuance that could affect the totals.

Since I used unfiltered profiles for comparison, the main profile contained a lot of bot spam, which was mostly disguised as Internet Explorer.

Today, our main profile is the most spammed, but it is also used as the level chosen for comparison, so the difference between the on-page code and the Tag Manager is actually somewhat larger.

Data loss due to GTM: 1-5%


GTM losses vary depending on which browsers and devices are used by visitors to your site. On Distilled.net, the difference is about 1.7%, our audience actively uses desktops and is technically advanced, Internet Explorer is rarely used. Depending on the vertical, losses can reach 5%.

I also did a breakdown by device:

image

Data loss due to incorrectly located on-page code: ~ 10%

At Teflsearch.com, around 7.5% of the data was lost against GTM due to incorrectly located code. Given that the Tag Manager itself underestimates the data, the total losses could easily reach 10%.

Bonus: data loss from channels


Above, we looked at areas where you can lose data altogether. However, there are other factors leading to incomplete data. We will look at them more briefly. The main problems here are the "dark" traffic and attribution.

"Dark" traffic

“Dark” traffic is direct traffic, which in reality is not direct.
And this is becoming an increasingly common situation.

Typical causes of “dark” traffic:


Also worth noting is the trend towards the growth of true direct traffic, which has historically been organic. For example, in connection with the improvement of the autocomplete feature in browsers, synchronization of search history on different devices, etc., people sort of “enter” the URL that they were looking for earlier.

Attribution


In general, a session in Google Analytics (and on any other platform) is a rather arbitrary construct. You may consider it obvious how a group of requests should be combined into one or more sessions, but in reality, this process relies on a number of rather dubious assumptions. In particular, it is worth noting that Google Analytics usually assigns direct traffic (including “dark” traffic) to a previous non-direct source, if one exists.

Conclusion


I was somewhat surprised by some of the results I received, but I am sure that I didn’t cover everything, and there are other ways to lose data. So, research in this area can be continued.

More similar articles can be read on my telegram channel (proroas).

Source: https://habr.com/ru/post/451282/


All Articles