Hướng dẫn python ab testing
Nội dung chính Show Nội dung chính
From experimental design to hypothesis testingImage by authorIn this article we’ll go over the process of analysing an A/B experiment, from formulating a hypothesis, testing it, and finally interpreting results. For our data, we’ll use a dataset from Kaggle which contains the results of an A/B test on what seems to be 2 different designs of a website page (old_page vs. new_page). If you want to follow along with the code I used, feel free to download the jupyter notebook at my GitHub page. Here’s what we’ll do:
To make it a bit more realistic, here’s a potential scenario for our study:
Before rolling out the change, the team would be more comfortable testing it on a small number of users to see how it performs, so you suggest running an A/B test on a subset of your user base users. 1. Designing our experimentFormulating a hypothesisFirst things first, we want to make sure we formulate a hypothesis at the start of our project. This will make sure our interpretation of the results is correct as well as rigorous. Given we don’t know if the new design will perform better or worse (or the same?) as our current design, we’ll choose a two-tailed test:
where p and pₒ stand for the conversion rate of the new and old design, respectively. We’ll also set a confidence level of 95%:
The α value is a threshold we set, by which we say “if the probability of observing a result as extreme or more (p-value) is lower than α, then we reject the Null hypothesis”. Since our α=0.05 (indicating 5% probability), our confidence (1 — α) is 95%. Don’t worry if you are not familiar with the above, all this really means is that whatever conversion rate we observe for our new design in our test, we want to be 95% confident it is statistically different from the conversion rate of our old design, before we decide to reject the Null hypothesis Hₒ. Choosing the variablesFor our test we’ll need two groups:
This will be our Independent Variable. The reason we have two groups even though we know the baseline conversion rate is that we want to control for other variables that could have an effect on our results, such as seasonality: by having a For our Dependent Variable (i.e. what we are trying to measure), we are interested in capturing the
This way, we can easily calculate the mean for each group to get the conversion rate of each design. Choosing a sample sizeIt is important to note that since we won’t test the whole user base (our population), the conversion rates that we’ll get will inevitably be only estimates of the true rates. The number of people (or user sessions) we decide to capture in each group will have an effect on the precision of our estimated conversion rates: the larger the sample size, the more precise our estimates (i.e. the smaller our confidence intervals), the higher the chance to detect a difference in the two groups, if present. On the other hand, the larger our sample gets, the more expensive (and impractical) our study becomes. So how many people should we have in each group? The sample size we need is estimated through something called Power analysis, and it depends on a few factors:
Since our team would be happy with a difference of 2%, we can use 13% and 15% to calculate the effect size we expect. Luckily, Python takes care of all these calculations for us: # Packages imports
We’d need at least 4720 observations for each group. Having set the 2. Collecting and preparing the dataGreat stuff! So now that we have our required sample size, we need to collect the data. Usually at this point you would work with your team to set up the experiment, likely with the help of the Engineering team, and make sure that you collect enough data based on the sample size needed. However, since we’ll use a dataset that we found online, in order to simulate this situation we’ll:
*Note: Normally, we would not need to perform step 4, this is just for the sake of the exercise Since I already downloaded the dataset, I’ll go straight to number 2. df = pd.read_csv('ab_data.csv') df.info() There are 294478 rows in the DataFrame, each representing a user session, as well as 5 columns :
We’ll actually only use the Before we go ahead and sample the data to get our subset, let’s make sure there are no users that have been sampled multiple times. session_counts = df['user_id'].value_counts(ascending=False)
There are, in fact, 3894 users that appear more than once. Since the number is pretty low, we’ll go ahead and remove them from the DataFrame to avoid sampling the same users twice. users_to_drop = session_counts[session_counts > 1].index
SamplingNow that our DataFrame is nice and clean, we can proceed and sample Note: I’ve set control_sample = df[df['group'] == 'control'].sample(n=required_n, random_state=22) ab_test.info()
ab_test['group'].value_counts()
Great, looks like everything went as planned, and we are now ready to analyse our results. 3. Visualising the resultsThe first thing we can do is to calculate some basic statistics to get an idea of what our samples look like. conversion_rates = ab_test.groupby('group')['converted'] Judging by the stats above, it does look like our two designs performed very similarly, with our new design performing slightly better, approx. 12.3% vs. 12.6% conversion rate. Plotting the data will make these results easier to grasp: plt.figure(figsize=(8,6)) The conversion rates for our groups are indeed very close. Also note that the conversion rate of the So… the 4. Testing the hypothesisThe last step of our analysis is testing our hypothesis. Since we have a very large sample, we can use the normal approximation for calculating our p-value (i.e. z-test). Again, Python makes all the calculations very easy. We can use the from statsmodels.stats.proportion import proportions_ztest, proportion_confintcontrol_results = ab_test[ab_test['group'] == 'control']['converted']
5. Drawing conclusionsSince our p-value=0.732 is way above our α=0.05 threshold, we cannot reject the Null hypothesis Hₒ, which means that our new design did not perform significantly different (let alone better) than our old one :( Additionally, if we look at the confidence interval for the
What this means is that it is more likely that the true conversion rate of the new design is similar to our baseline, rather than the 15% target we had hoped for. This is further proof that our new design is not likely to be an improvement on our old design, and that unfortunately we are back to the drawing board!
|