Calculation of cannibalization based on the classic A / B test and the bootstrap method

This article describes a method for calculating cannibalization for a mobile application based on the classic A / B test. In this case, the target actions within the re-attribution process from the advertising source (Direct, Criteo, AdWords UAC and others) are considered and evaluated in comparison with the target actions in the group to which the advertisement was turned off.

The article provides an overview of the classical methods of comparing independent samples with a brief theoretical basis and a description of the libraries used, including briefly describes the essence of the bootstrap method and its implementation in the FaceBook Bootstrapped library, as well as the problems encountered in practice when using these techniques, and how to solve them.

Actual data is either obfuscated or not presented in order to preserve the none-disclosure agreement.

In the future, I plan to complement and slightly modify this article as new facts appear, so this version can be considered the first release. I would be grateful for comments and reviews.
')

Introduction

Cannibalization is the process of flowing traffic, full and targeted, from one channel to another.

Marketers usually use this indicator as an additional K factor when calculating the CPA: the pre-calculated CPA is multiplied by 1 + K. In this case, CPA means total spending on attracting traffic / number of targeted actions that are monetized directly, that is, brought actual profit - for example, a target call, and / or monetized indirectly - for example, an increase in the volume of the ad database, audience growth, and so on.

When free channels (for example, calls from organic SERPs, links to links on sites that are free for us) cannibalize for money (Direct, Adwords instead of organic, advertising in social networks feeds instead of clicks on ads, free placed in groups, and so on), this carries with it risks of financial loss, so it is important to know the cannibalization rate.

In our case, there was the task of calculating the cannibalization of "organic" transitions into the application by transitions from the Criteo advertising network. Observation is considered a device or user-uid (GAID / ADVID and IDFA).

Preparing the experiment

You can prepare an audience for the experiment by dividing users into groups in the interface of the AdJust analytical system to isolate those who will see ads from a certain advertising network (control sample) and those who will not be shown ads using GAID or ADVID and IDFA, respectively. (AdJust provides API Audience Builder). Further, on the control sample, you can include an advertising campaign in the advertising network studied in the experiment.

I note from myself that, as it seems intuitively, the following implementation of the experiment would be more competent in this case: choose four groups - those who have disabled retargeting from all channels (1), as an experimental group, and those who have Only retargeting with Criteo (2) is enabled; those who only had to disable retargeting with Criteo (3), those who had enabled all retargeting (4). Then it would be possible to calculate (1) / (2), having obtained the actual value of cannibalization of “organic” transitions to the application in the Criteo network by advertising campaigns, and (3) / (4), having obtained the cannibalization of Criteo in a “natural” environment (after all, Criteo is obviously , can cannibalize other paid channels). The same experiment should be repeated for other advertising networks to determine the impact of each of them; in an ideal world, it would be good to explore also cross-cannibalization between all key paid sources that make up the largest share of total traffic, but it would take so long (both to prepare experiments in terms of development and to evaluate the results), which would cause criticism for unwarranted meticulousness.

In fact, our experiment was carried out under conditions (3) and (4), the samples were broken down in a ratio of 10% to 90%, the experiment was carried out for 2 weeks.

Preliminary preparation and verification of data

Before the start of any research, an important stage is the competent preparation and cleaning of data.

It should be noted that the actual active devices for the period of the experiment were 2 times less (42.5% and 50% of the control and experimental groups, respectively) than the devices in the full initial samples, which is explained by the nature of the data:

firstly (and this is the key reason), the sample for retargeting from Adjust contains identifiers of all devices that have ever installed the application, that is, those devices that are no longer used, and those from which the application was already deleted
secondly, it is not necessary that from all devices during the experiment was entered into the application.

However, the calculation of cannibalization, we carried out on the basis of data from the full sample. For me personally, the correctness of such a calculation still seems to be a controversial issue - in general, in my opinion, it is more correct to clean everyone who deleted the application and did not install it any more with the appropriate tags, as well as those who did not enter the application for more than a year - for this period of time the user could change the device; minus - in this way, for the experiment, those users who did not go to the application could be removed from the sample, but could do this, we show them advertising on the Criteo network. I want to note that in a good world, all these forced neglects and assumptions should be investigated and verified separately, but we live in a world of do it fast and furry.

In our case, it is important to check the following points:

We check for the presence of intersection in our initial samples - experimental and control. In a properly implemented experiment, there should be no such intersections, but in our case there were several duplicates from the experimental sample in the control. In our case, the share of these duplicates in the total volume of devices involved in the experiment was small, so we neglected this condition. If duplicates were> 1%, it would be necessary to consider the experiment as incorrect and to re-examine, after having cleaned the duplicates.
We check that the data in the experiment were really affected - in the experimental sample retargeting should have been disabled (at least with Criteo, in a correctly set experiment - from all channels), therefore it is necessary to check for the absence of DeviceID from the experiment in retargeting with Criteo. In our case, the DeviceID from the experimental group did fall into retargeting, but there were less than 1% of them, which is negligible.

Direct evaluation of the experiment

We will consider the change of the following target metrics: absolute - the number of calls, and relative - the number of calls per user in the control (seen advertising on the Criteo network) and experimental (the advertisement was turned off) groups. In the code below, the variable data means the structure pandas.DataFrame, which is formed from the results of an experimental or control sample.

There are parametric and non-parametric methods for assessing the statistical significance of differences in values in unrelated samples. Parametric evaluation criteria provide greater accuracy, but have limitations in their application - in particular, one of the basic conditions is that the measured values for the observations in the sample should be distributed normally.

1. The study of the distribution of values in the samples for normality

The first step is to examine the existing samples for the type of distribution of values and the equality of sample variances using standard tests - the Kolmogorov-Smirnov and Shapiro-Wilks criteria and the Bartlett criteria implemented in the sklearn.stats library, taking p-value = 0.05:

#    : def norm_test(df, pvalue = 0.05, test_name = 'kstest'): if test_name == 'kstest': st = stats.kstest(df, 'norm') if test_name == 'shapiro': st = stats.shapiro(df) sys.stdout.write('According to {} {} is {}normal\n'.format(test_name, df.name, {True:'NOT ', False:''}[st[1] < pvalue])) #    : def barlett_test(df1, df2, pvalue = 0.05): st = stats.bartlett(df1, df2) sys.stdout.write('Variances of {} and {} is {}equals\n'.format(df1.name, df2.name, {True:'NOT ', False:''}[st[1] < pvalue]))

Additionally, for visual assessment of the results, you can use the histogram function.

 data_agg = data.groupby(['bucket']).aggregate({'device_id': 'nunique', 'calls': 'sum'}).fillna(0) data_conv = data_agg['calls_auto']/data_agg['device_id'] data_conv.hist(bins=20)

You can read the histogram like this: 10 times in the sample, a conversion rate of 0.08 was found, 1 - 0.14. On the number of devices as observations for any of the conversion indicators it says nothing.

In our case, the distribution of the parameter value both in absolute values and in relative values (the number of calls per device) in the samples is not normal.
In this case, you can apply either the non-parametric Wilcoxon criterion, implemented in the standard library sklearn.stats, or try to reduce the distribution of values in the samples to a normal form and apply one of the parametric criteria - Student's aka t-test or Shapiro-Wilks.

2. Methods of reducing the distribution of values in samples to normal

2.1. Sub-buckets

One approach to reducing distribution to a normal view is the sub-bucket method. Its essence is simple, and the theoretical basis is the following mathematical thesis: according to the classical central limit theorem, the distribution of averages tends to normal - the sum of n independent identically distributed random variables has a distribution close to normal, and, equivalently, the distribution of sample averages of the first n independent identically distributed random variables values tends to normal. Therefore, it is possible to break existing buckets into sub-buckets and, accordingly, taking average values for sub-buckets for each of the buckets, we can get a distribution close to normal:

 #   subbucket' data['subbucket'] = data['device_id'].apply(lambda x: randint(0,1000)) # Variant 1 data['subbucket'] = data['device_id'].apply(lambda x: hash(x)%1000) # Variant 2

There can be many options for partitioning, it all depends on the imagination and moral principles of the developer - you can take an honest random or use hash from the original bucket, thereby taking into account in the scheme the mechanism of its issuance.

However, in practice, out of several dozen code launches, we received a normal distribution only once, that is, this method is neither guaranteed nor stable.

In addition, the ratio of target actions and users to the total number of actions and users in sub-buckets may not be consistent with the original backet, so you must first check that the relationship is maintained.

 data[data['calls'] > 0].device_id.nunique()/data.device_id.nunique() # Total buckets = data.groupby(['bucket']).aggregate({'device_id': 'nunique', 'calls': 'sum'}) buckets[buckets['calls'] > 0].device_id.nunique()/buckets.device_id.nunique() # Buckets subbuckets = data.groupby(['subbucket']).aggregate({'device_id': 'nunique', 'calls': 'sum'}) subbuckets[subbuckets['calls'] > 0].device_id.nunique()/subbuckets.device_id.nunique() # Subbuckets

In the course of such a test, we found that the conversion ratios for subbuckets relative to the original sample are not preserved. Since we need to additionally guarantee the consistency of the ratio of the proportion of calls in the output and initial samples, we apply class balancing, adding weighting - so that the data are separately selected by subgroups: separately from observations with target actions and separately from observations without target actions in the right proportion. In addition, in our case, the samples were not evenly distributed; it seems intuitively that the mean should not change, but how the unevenness of the samples affects the variance is not obvious from the dispersion formula. In order to clarify whether the difference in sample sizes affects the result, the Xi-square test is used - if a statistically significant difference is detected, then a larger data frame will be sampled to a smaller one:

 def class_balancer(df1, df2, target = 'calls', pvalue=0.05): df1_target_size = len(df1[df1[target] > 0]) print(df1.columns.to_list()) df2_target_size = len(df2[df2[target] > 0]) total_target_size = df1_target_size + df2_target_size chi2_target, pvalue_target, dof_target, expected_target = chi2_contingency([[df1_target_size, total_target_size], [df2_target_size, total_target_size]]) df1_other_size = len(df1[df1[target] == 0]) df2_other_size = len(df1[df1[target] == 0]) total_other_size = df1_other_size + df2_other_size chi2_other, pvalue_other, dof_other, expected_other = chi2_contingency([[df1_other_size, total_other_size], [df2_other_size, total_other_size]]) df1_target, df2_target, df1_other, df2_other = None, None, None, None if pvalue_target < pvalue: sample_size = min([df1_target_size, df2_target_size]) df1_rnd_indx = np.random.choice(df1_target_size, size=sample_size, replace=False) df2_rnd_indx = np.random.choice(df2_target_size, size=sample_size, replace=False) df1_target = pd.DataFrame((np.asarray(df1[df1[target] == 1])[df1_rnd_indx]).tolist(), columns = df1.columns.tolist()) df2_target = pd.DataFrame((np.asarray(df2[df2[target] == 1])[df2_rnd_indx]).tolist(), columns = df2.columns.tolist()) if p_value_other < pvalue: sample_size = min([df1_other_size, df2_other_size]) df1_rnd_indx = np.random.choice(df1_other_size, size=sample_size, replace=False) df2_rnd_indx = np.random.choice(df2_other_size, size=sample_size, replace=False) df1_other = pd.DataFrame((np.asarray(df1[df1[target] == 0])[df1_rnd_indx]).tolist(), columns = df1.columns.tolist()) df2_other = pd.DataFrame((np.asarray(df2[df2[target] == 0])[df2_rnd_indx]).tolist(), columns = df2.columns.tolist()) df1 = pd.concat([df1_target, df1_other]) df2 = pd.concat([df2_target, df2_other]) return df1, df2 exp_classes, control_classes = balance_arrays_classes(data_exp, data_control)

At the output, we obtain data balanced by size and consistent on the basis of conversion ratios, the metrics under study (calculated for the average values for sub-buckets) in which are already normally distributed, which can be seen both visually and by the results of applying the test criteria already known to us for normality (with p-value> = 0.05). For example, for relative indicators:

 data_conv = (data[data['calls'] > 0].groupby(['subbucket']).calls.sum()*1.0/data.groupby(['subbucket']).device_id.nunique()) data_conv.hist(bins = 50)

Now, t-test can be applied to the average over sub-buckets (thus, not the device_id, not the device, but the sub-bucket acts as observation).

Making sure that the changes are statistically significant, you can carry out with a clear conscience something for which we started all this - the calculation of cannibalization:

 (data_exp.groupby(['subbucket']).calls.avg() - data_cntrl.groupby(['subbucket']).calls.avg() )/ data_exp.groupby(['subbucket']).calls.avg()

The denominator should be ad-free traffic, that is, experimental.

3. Bootstrap method

The bootstrap method is an extension of the sub-bucket method and is a more advanced and improved version; A software implementation of this method in Python can be found in the Facebook Bootstrapped library.
Briefly, the idea of bootstrap can be described as follows: the method is nothing but the constructor of samples that are formed in a similar way to the sub-bucket method in a random way, but with possible repetitions. It can be said that the placements from the general population (if one can be called the original sample) are returned. At the output, the average (medians, sums, etc.) of the average for each of the sub-samples formed.

The main methods of the library FaceBook Bootstrap :

 bootstrap()

- implements the mechanism for the formation of subsamples; by default returns lower bound (5 percentile) and upper bound (95 percentile); To return a discrete distribution in this range, you must set the parameter return_distribution = True (it is generated by the auxiliary function generate_distributions () ).

You can specify the number of iterations using the num_iterations parameter, in which subsampling will be formed, and the number of iteration_batch_size subsamples for each iteration. At the output of generate_distributions () , a sample will be formed with a size equal to the number of iterations num_iterations , the elements of which will be an average of the values of the iteration_batch_size samples calculated at each iteration. With large sample sizes, the data may stop getting into memory, therefore in such cases it is desirable to reduce the iteration_batch_size value.

Example : Let the original sample be 2,000,000; num_iterations = 10,000, iteration_batch_size = 300. Then on each of 10,000 iterations, 300 lists of 2,000,000 items will be stored in memory.

The function also allows parallel computations on several processor cores, on several threads, setting the required number using the num_threads parameter.

 bootstrap_ab()

performs all the same actions as the bootstrap () function described above, but additionally also aggregates the average values using the method specified in stat_func - from the num_iterations values. Next, the metric specified in the compare_func parameter is calculated and the statistical significance is evaluated.

 compare_functions

- a class of functions that provides tools for the formation of metrics for evaluation:

 compare_functions.difference() compare_functions.percent_change() compare_functions.ratio() compare_functions.percent_difference() # difference = (test_stat - ctrl_stat) # percent_change = (test_stat - ctrl_stat) * 100.0 / ctrl_stat # ratio = test_stat / ctrl_stat # percent_difference = (test_stat - ctrl_stat) / ((test_stat + ctrl_stat) / 2.0) * 100.0

 stats_functions

- the class of functions from which the method of aggregation of the studied metric is selected:

 stats_functions.mean stats_functions.sum stats_functions.median stats_functions.std

As a stat_func, you can use a custom custom function, for example:

 def test_func(test_stat, ctrl_stat): return (test_stat - ctrl_stat)/test_stat bs.bootstrap_ab(test.values, control.values, stats_functions.mean, test_func, num_iterations=5000, alpha=0.05, iteration_batch_size=100, scale_test_by=1, num_threads=4)

In fact, (test_stat - ctrl_stat) / test_stat is the formula for calculating our cannibalization.

Alternatively or for practical experimentation, you can initially get distributions using bootstrap () , check the statistical significance of the differences in the target metrics using t-test and then apply the necessary manipulations to them.
An example of how “qualitative” normal distribution can be obtained using this method:

More detailed documentation can be found on the repository page .

At the moment, this is all that I wanted (or managed to) to tell. I tried to briefly but clearly describe the methods used and the process of their implementation. It is possible that the methodologies require adjustments, so I will be grateful for the feedback and reviews.

I also want to thank my colleagues for their help in preparing this work. In case the article receives mostly positive feedback, I will indicate here their names or nicknames (by prior agreement).

Best wishes to everyone! :)

PS Dear Championship channel , the task of evaluating the results of A / B testing is one of the most important in Data Science, because more than one launch of a new ML model in production is not complete without A / B. Maybe it is time to organize a competition to develop a system for evaluating the results of A / B testing? :)

Source: https://habr.com/ru/post/451488/

All Articles