How do we conduct A / B tests

Recently on Habré there are more and more posts about the feasibility of conducting A / B testing (about their benefits, increasing conversions). If you carefully follow the instructions, a very interesting thing turns out: a slight change in the interface and output information logic can lead to a significant change in the conversion, if you multiply all the conversion numbers under such a cat, you can get a 2-fold increase.

I already imagine dozens or even hundreds of webmasters and managers who check every project change with the help of A / B tests, order an audit in usability laboratories and wait for the growth of conversions to notorious 2 times. What is actually happening - let's see ...

In fact, the A / B test is a common statistical experiment and fully complies with all the rules of statistics. According to these rules, the notion of a “change in conversion of 17%” cannot exist.

For those who are interested in the rationale

It may exist: change in conversion by 17% with a confidence interval of 10% (in this case, the change is significant).

And maybe so: a change in conversion of 17% with a confidence interval of 20% (in this case, the change is insignificant).
')
What variables need to know to change the differences with a confidence interval?

1) The size of the control groups (introductory: 95% confidence interval, statistical power 80%);
Formula for measuring the size of control groups

Formula for measuring the size of control groups

where

where δ is the minimum detectable effect in%;

p - basic conversion

n is the number of participants in each group.

Then for the base conversion of 2% and a change of 17% (0.34%) you need:

n = (16 * 0.02 * (1-0.0034)) / 0.0034² = 27731 participants in each group, in order to fix the change significant.

2) Confidence interval:

where

p - basic conversion

n is the number of participants in each group.

z - 2 for 95% accuracy, 3 for 99.8% accuracy

An example of a confidence interval for a conversion of 2%, an accuracy of 95% and a sample size of 10,000:

= 0.0028 (0.28%, which means that all values obtained during the experiment, but lying between 2.28 and 1.72 are not significant).

From this it can be assumed that for most sites on the Runet, conducting A / B tests cannot give a significant result, they simply have little traffic.

Although now the majority of SaaS solutions show the results of A / B tests on any and even statistically insignificant sample, a lot of wrong conclusions can be drawn from here.

We were tasked with the top - to learn how to conduct A / B tests and use it all the time.

“What could be simpler” - I thought, we divide all traffic by user cookie into even / odd, we hang something in the URL and Ya.metrika or GA themselves will tell the whole truth about testing. I understand it now - how wrong.

We are very lucky. Our System Architect quickly got acquainted with the materiel and understood everything and made a simple interface for recalculating all the necessary parameters. We began to measure (for reference - the daily attendance of the site rarely falls below 30,000 unique users):

A / B test number 1 (Test for the presence of a block of recommendations for the user (I really wanted to start, give life to the project, get leads)):

I look, how to say, “according to the beacons”, the Architect writes everything.

First day + 10% of leads on the control group with a block, I rub my hands (it smells like a bonus).

Second + 9% - we light nothing

...

A week passed, on average for beacons - a good + 7%, I finish the A / B test, turn on the unit for all users. (The architect says, “the interval did not diverge.” I did not attach any importance to this).

A / B test number 2 (Test block size recommendations)

They drove the test for more than a month - a significant result - no.

We report to the leadership of the voice - learn how to carry out A / B tests - we can count everything. What does the new task do? And let's make a site like Apple.com with one big picture and a minimalistic menu.

/ test №3 (Aggregator vs Monobrand)

“You can't do that, we worked so hard on dynamic recommendations - and then hello, like Apple.com” - there were a lot of emotions, but we had nothing to do - we did something, started the test.

The first day and the test page brings more and more - by 13%. I am terrified, but what about my expertise ... began to bot botch the materiel.

The second day - the indicators equaled, I realized that you need to wait.

10 days passed, the confidence interval was finally sold, the old page turned out to be really better, but not by several times, but a little bit. On the overall conversion rate by refrigerator + 12% with a confidence interval of 10%, i.e. just 2 percent you can be sure or to drive the test further.

When carrying out A / B tests, I ask you not to make my mistakes, check the statistical significance of the results, ask how these or other conclusions were made, on the basis of what.

P.Y. A / B tests for projects for the second thousand Russian Alexis is a myth. Traffic is not enough.

Source: https://habr.com/ru/post/235597/

All Articles

How do we conduct A / B tests

More articles: