Sampling data significantly reduces the load on computing power. But how can you judge the number of holes in the cheese in one piece? What if, due to sampling, you can easily lose 20 thousand or more dollars a day?
Often,
sampling makes it difficult to carry out an accurate analysis of the data flow, as evidenced by the case under the cat.

Sampling is a way to form a representative sample so that conclusions can be drawn about the population.
')
Representativeness can be ensured by selecting elements from the general population randomly. This means that each site visitor will have the same chance to get into the report. In most cases, this does not affect the shape of the graph. The difference in values will not be noticeable when converted to a percentage. But sampling can affect statistically significant differences.
In order for the sampled data to adequately convey conclusions about the entire population, there should not initially be any anomalies in the sample: outliers or dips. But no one is immune from them, and the posed data may be distorted.

Moreover, they can even be hidden by the marketing effect, as described
here .
Why use sampling
Google and Yandex use this technique to reduce the load on their servers. The report is generated much faster, but it may mislead the marketer.
Case: how to lose money due to sampling
Company X receives an average of 2 million users per day. In this case, Google already applies data sampling. Every day the company buys 50 thousand users for $ 2 each. Thus, $ 100,000 per day is spent on advertising.
The average conversion rate for paid traffic in registration was 25% according to Google Analytics. When checking on the
t.onthe.io service, which does not use sampling, the average conversion was 20%.

It means that some data was lost or distorted during sampling. Company X lost $ 20,000 a day because of this.
How to avoid GA sampling
The sampled data does not always objectively reflect the situation. There are several ways to avoid sampling.
1. Premium GA account
With a premium account, Google gives out clean data up to 1 billion hits per month. But there is an account worth $ 150,000 a year, and there are cheaper ways.
2. Reducing the sampling time interval
If a report is used for a large time period (for example, a report for the year), then Google will most likely sample this data. To prevent this, you can split the time interval into smaller parts, for example, monthly. And then happily blind all the months by hand.
3. Increase accuracy
You can increase the sampling accuracy in the GA settings when generating a report. The error in the presentation of data will decrease, but will not be reduced to zero.
4. Data segmentation using views
Set up multiple data views. For example, there are 10 main sections on the site, then you can make 8 data representations that will receive information each from its own channel. In the general flow, the same 2 million users per month are looking at the site. Each section receives 200,000 visits. It turns out that in each section the data should not be sampled. The downside is that the analytics of the whole site will have to be glued by hand again.
You can also use the Google Analytics Query Explorer tool or R scripts. Learn more about these methods
here .
Services that do not sample data
Many web analytics services do not sample data. These include
t.onthe.io ,
stathat ,
Librato ,
Sumologic .
Abstract
- Conclusions based on sampled data can lead to loss of information or money.
- You can get rid of GA sampling in several ways: by decreasing the time interval, by data segmentation, by adjusting the accuracy.
- Services that do not sample data: t.onthe.io, stathat, Librato, Sumologic.