Hướng dẫn python bayesian a/b testing

Hướng dẫn python bayesian a/b testing

If you landed on this page, you already know what an A/B test is, but maybe you have some doubt on how to analyse the results within a bayesian framework.

To start the post, then, let me point you the practical difference of the bayesian approach from the classical, ‘frequentist’ one: the difference is in the form in which the metric you want to study actually is. For the frequentists, it’s a simple number, while for bayesians it’s a distribution.

While having a distribution instead of a number appears as just a complication, it turns out to be extremely useful in so many cases, and in A/B tests too. And with this post, you will see that in practice is not so difficult to deal with it. In fact, just few lines of code.

Edit 18/01/2018: the code on this page has become the baseline of an A/B testing tool which work with Mixpanel.

Straight to the code

I promised an easy guide. So in this part I will just show you the code, and in the next one, if you are interested, you will see some of the details hidden behind it.

Note that no approximation method have been used: no Markov Chain Monte Carlo (MCMC), or any other stochastic (and slow) process. And so, no proabilistic programming framework is needed.

Let’s introduce the sample data: we already run an A/B test for checking the Conversion Rate (CR, since now on) of a web page, and the results are in the Table 1, below.

Hướng dẫn python bayesian a/b testing

Table 1: The data

Here the code. You just have to install Scipy and Numba, insert your numbers, and you will get your result.

And the content of the module imported, calc_prob.py:

Based on the numbers in Table 1, the Test option performs much better than the Control one: almost 60% uplift, with more than 98% probability of being better.

As you can see, these are bare, agnostic, results. Wether they are good or not depends on your situation. A deeper consideration on the thresholds will be treated in another article, but speacking in terms of probability, anything above 95% should be good enough.

Easy, right? Now let’s dig a bit in the details.

The details

In the code I initialized two “Beta” functions, one for each option, feeding them with the numbers we had in the table:

Hướng dẫn python bayesian a/b testing

Those are in fact the functions which models the data for an A/B test. To better introduce their behavior, check the animation below:

Hướng dẫn python bayesian a/b testing

Figure 1: animation of the Beta distribution for a fixed CR. The ‘X’ axis has been zoomed.

Each fucntion is built on top of a flat, uninformative one, defined by β(1,1). The more informations (data) we add to this base, the more the function becomes narrower.

Now you are probably thinking: is it really so simple? Why does exist a function that appear to be built exactly for model A/B tests?

The answer lies in a nicety of the Bayes theorem. Without going too much into the details, usually solutions to the Bayes theorem are hard (if not impossible) to be solved exactly, and that’s why few approximation methods have been developed, like Markov Chain Monte Carlo (MCMC). You can find a lot of articlesaround the web which uses such procedures.

But A/B tests are a lucky case, and we actually can have an exact solution based on the concept of “Conjugate priors”. When this concept is applicable, the posterior function of the Bayes theorem lies in the same family of the prior one, and so we can build the final function with an iterative process.

A/B tests are random experiments with exactly two possible outcomes, “convert” or “don’t”, and as such is described as a Bernoulli trial, and the Beta distribution is the conjugate prior for such a process. That’s why we can dare to use it in the simple way I did. But if you still have doubts and/or you want to dig more deep, you can check the mathematical details in here: the example in that Wikipedia page is exactely our case.

Coming back to our case, let me show you the two distributions:

Hướng dẫn python bayesian a/b testing

Figure 2: distributions for the two CR.

You can notice that the peaks of the two distributions coincide with the values you calculate in the classical way:

Hướng dẫn python bayesian a/b testing

The difference, as stated before, is that you have a whole Probability Density Function (PDF) for the CR, instead of a simple number. And thanks to this, you can calculate, as example, the variance (popularly said ‘the error’) on the CRs, which is also shown on the label within the plot.

Now you can calculate what is called the ‘uplift’, i.e. how much Test option increase the CR with respect to the Control one:

Hướng dẫn python bayesian a/b testing

But, at this point, you need to asses how trustful is this result. How? Well, you can estimate the probability for one option to be better than the other!

Note that, in the classical, “frequentist”, paradigm you have no way of calculate such probability. In that approach, you usually calculate the ‘p-value’, and then check whether or not it falls under an arbitrarily chosen threshold (‘α value’, typically 5%), and eventually you can declare something almost impossible to explain to any manager/client/board: “with 95% confidence level, we can reject the null hypothesis”. And then, you have to explain that this declaration is very different from “this hypothesis is better then the other with 95% probability”, which is the very sentence they really want to hear from you.

But here we are in the bayesian world, and here we can actually say that. In fact, we have PDFs which define our CR, and the probability is given by the area under the curve. This is, practically, the definition of PDF.

A simple example: do you want the probability for the Test option’s CR to be greater than, say, 0.003? Well, it’s just the area under the Test curve between 0.003 and 1.0, which is shaded in the Figure 3.

Hướng dẫn python bayesian a/b testing

Figure 3: The probability for the Test option CR to be greater than 3E-03.

In Math language, you calculate the area by integrate over the curve between the two limits: 0.003 (the limit that we choose) and 1 (the hard limit).

With Python, we can calculate this integral exactely (still: no Monte Carlo needed), courtesy of the Mpmath library:

In this example we considered only one distribution (the Test one), but to measure the trustefulness of our uplift we have to consider both the distributions (Control and Test), and so in order to visualize the situation we have to add a dimension. By consequence, the probability we are going to measure is is defined not by an area anymore, but by a volume instead. In particular, the volume under the joint probability distribution of Test and Control where the Test is greater than the Control. Let’s see what it is:

Imagine the plot of Figure 4 in the left as the picture of a mountain, and its view from a satellite is in the right plot. Focusing on the plot on the right, you can see the grey line as the border between the land belonging to the Test option (upper triangle) and the land of the Control option. In this scenario, the question become: how much amout of mountain do Test have more than Control?

UPDATE (13/05/2019): Since I had several requests about this, below you can find the code for recreate the plot of Figure4, right. This uses Monte Carlo sampling, and a lot of points are needed to have a figure nice to see:

Code for creating the right plot of the Figure 4.

To calculate this volume (in the upper triangle of Figure 4), I saw a lot of posts around the web which, again, use approximated methods (like Monte Carlo). Well, the people who wrote them are not aware that an exact method has already been deployed by John Cook in 2005 (chapter 2, here). The code within the module calc_prob.py reproduces that formulation, and comes almost in full from a version originally wrote by Antti Rasinen.

In our case the area is 0.98, which means that the ‘Test’ option performs better than the ‘Control’ one with 98% probability.

Let me highlight that if the result was, hypotetically, 50%, than the mountain was perfectly shared, and this means no options is better than the other, while smaller value means than Test option is actually worse than the control one.

That’s all.

As you see, the details behind the few lines of code at the start of this posts are deep, and not so straightforward. But I hope I’ve been clear and simple in the description of the concepts, as I stated in the title.