Hướng dẫn python calculate statistical significance

Hướng dẫn python calculate statistical significance

Photo by Oğuzhan Akdoğan on Unsplash

Using Python to examine if a result is due purely to chance.

“Using ads create more revenue for our product”. “ The weight loss pill caused greater weight loss than those who took a placebo.” “Battery ‘A’ last ten times longer than its competitor.” These types of statements appear often. When looking at data it can be tempting to make a quick assumption that a variable resulted in a certain result. Being too hasty with these conclusions can cause a lot of problems. In Data science being able to determine if a result is due to chance can help in selecting a model as well as a way to check for sampling errors.

One of the ways to build confidence that a result isn’t due purely to chance is by determining statistical significance. Let’s take a look at what statistical significance is how to determine it by using the p-value.

This example will use Python to show how to represent statistical significance in your code/ jupyter notebook. It is recommended you understand some Python basics. We will also use the following Python libraries NumPy, Matplotlib, and Pandas. If you aren’t familiar with any of them I recommend you use the links provided.

The Data

We’re going to look at the “Sales of summer clothes in E-commerce Wish” dataset from kaggle.com. If you aren’t familiar with Wish.com, it is an online e-commerce platform. The data was collected by scraping the website for all items listed under the category of “Summer” in August 2020. You can read more about how the data was obtained as well as each column/ variable by clicking the above link. The dataset shows that items where the company has a profile picture, have greater total sales(in euros) than items where the company doesn’t have a profile picture. We are going to try to determine if there is any statistical significance to this result.

The Key Concepts

  1. Hypothesis testing by determining a null hypothesis and an alternative hypothesis
  2. P-value and p-value threshold
  3. Test statistic
  4. Permutation testing
  5. Using the p-value to confirm or reject the null hypothesis

Hypothesis Testing

To establish statistical significance we have to come up with a null hypothesis and an alternative hypothesis. A null hypothesis is where the variable has no impact on our end result. The alternative hypothesis is where the variable does have an impact on the end result.

For this example our null and alternative hypothesis will be:

  • Null Hypothesis: On average, the total sales of an item(in euros) when a company has a profile picture is the same as the total sales of an item(in euros) if a company doesn’t have a profile picture.
  • Alternative Hypothesis: On average, the total sales of an item(in euros) is higher if a company has a profile picture than the total sales of an item(in euros)if a company doesn’t have a profile picture.

The P-Value Threshold

The p-value will is explained in depth later on but we will touch on it briefly now. It is always important to establish our p-value threshold before we start testing. The p-value is a way for us to quantify how rare our results are when determining if the null hypothesis is true. The lower the p-value, the less like the results are due purely to chance.

The p-value threshold is a number we will choose that if crossed we can conclude our null hypothesis is true. We choose how high or how low we want this to be based on our test. If we are testing the statistical significance of something that needs to be very precise we might choose a very low threshold like .001. For a test that doesn’t require as much precision such as sales or a website’s engagement, we would go higher. The standard threshold is .05, or 5 %. Since we are looking at sales we will go with the standard threshold. If our p-value is under .05 we will reject the null hypothesis.

Preparing Our Data

Once we have downloaded the dataset from Kaggle we can use pandas to read it into a DataFrame.

There are 3 columns of interest to us “price”, “units_sold”, and “merchant_has_profile_picture”. We will take the following steps before moving on.

  1. We will drop all but the three columns mentioned above.
  2. Check to see if there are any missing values.
  3. Use df.describe() to determine if any values seem inaccurate and possibly giving unexpected results.
  4. Remove or correct any rows that we deem to possibly contain incorrect information
  5. Calculate the total amount in euros each item has generated by multiplying the “price” column by the “units_sold” column.

Below is the python code used to take the above sets.

***the multi-line comment syntax is used represent the output***

With our dataset ready, let’s split our dataset into two groups. One will contain items where the company has a profile picture(group a) and the other will contain the items where the company does not have a profile picture(group b).

Once broken into two groups, we will find the mean of the total sales for each group. We will also look at how many rows are in each group as it will help us determine how to build our random groups later on.

Test Statistic

Now that we have the mean of the total sales from both groups we can use that to determine our test statistic. The test statistic is a numerical value that we will use to determine if the difference between our two groups is from random chance. In this case, the test statistic will be the mean difference of the total sales from using a profile picture and not using a profile picture. Our test statistic is about 33,184 euros.

Permutation Test

Our goal now is to recreate our two groups many times to see the likelihood that our test statistic occurred by chance. A permutation test is how we will accomplish this goal.

We’ll take the following steps in Python:

  1. Create an empty list to hold 1000 mean differences. This will be our generated sampling distribution.
  2. Initialize a series of the “total_sales”.
  3. Make a for loop that will iterate 1000 times.
  4. For each iteration of the for loop, we’ll randomly place each total sale in either group “a” or group “b”. In order to recreate the size difference in the two groups, we will give a 14.4% chance for the value to be group “a” and an 85.6% the value will go into group “b”. This percentage was based on the row count we determined earlier.
  5. Once our random group “a” and random group “b” is formed, we will find the mean of each group.
  6. We’ll then subtract the mean of our random group “b” from our random group “a” and append it to the list initialized in step 1.
  7. Finally, we’ll create a histogram to visualize the frequency of this sampling distribution.

Hướng dẫn python calculate statistical significance

histogram of the randomized mean differences

From the histogram, we can see the most frequent difference in means is around 0. This is very different than our test statistic. In fact, doesn’t even show the test statistic value. The p-value will give us a way to quantify how many times the test statistic value is in the random sample distribution of mean differences.

P-Value

In the sampling distribution we generated, most of the values are centered around the mean difference of 0. If it were purely up to chance, it’s more likely both groups would have generated the same amount of total sales(the null hypothesis). But since the observed test statistic is not near 0, it’s possible that having a profile picture could be responsible for the mean difference in the dataset.

Let’s look at the sampling distribution to determine the number of times a value of 33,184 or higher appeared. We can then divide that frequency by 1000. This will give us the probability that the mean difference of 33,184 or higher is purely due to random chance.

This probability is called the p-value. If this value is high, it means that the test statistic could easily have happened randomly and the profile picture probably didn’t play a role. On the other hand, a low p-value implies that there’s a small probability that the mean difference we observed was because of random chance.

Let’s create a for loop that will check tally how many times our test statistic shows up in our group of the randomly generated mean differences. We will then take that tally and divide it by our total iterations(1000). Our output will be the p-value.

The Outcome

Earlier we decided out p-value threshold would be .05. 0 is less than .05 which means we can rule out our test statistic was a result of random chance. We can now reject our null hypothesis and confirm our alternative hypothesis.

Final Thoughts

Statistical significance is a great way to examine data and determine if a variable has an impact on the final outcome. In this example, we took some steps with the help of Python to determine the statical significance of having a profile picture to the result of the total sales of an item.

While we were able to quantify this significance with the help of the p-value we should still go back and look at the dataset. There a few issues that should be addressed before any major conclusions are made.

  1. Our sample dataset from Wish.com is very limited. It only contains summer clothing items. It would be preferred to get some more data that covers a longer period of time and a greater variety of items
  2. The sold item count isn’t precise. On the website, the sold item count is listed in ranges ie 1000+, 2500+, etc. When the data was gathered the plus signs must have been eliminated. It would be preferred to get the actual numbers.
  3. Although we quickly looked for outliers we didn’t examine the items thoroughly enough to see if there were any other reasons for a high total sales value in the profile picture group. It’s possible that there are some other correlations we didn’t account for.

Although more information may be needed to make a decision on whether or not having a profile picture causes more sales we have taken a step in the right direction by determining the p-value. We can say based on this test it appears having a profile picture has a statistically significant impact on the total sales on an item on Wish.com.

Statistical significance is a helpful tool in Data science and should be understood and used to help make decisions when you need to know if a result is significant or happened by chance.