Hướng dẫn python calculate statistical significance
Show Using Python to examine if a result is due purely to chance.“Using ads create more revenue for our product”. “ The weight loss pill caused greater weight loss than those who took a placebo.” “Battery ‘A’ last ten times longer than its competitor.” These types of statements appear often. When looking at data it can be tempting to make a quick assumption that a variable resulted in a certain result. Being too hasty with these conclusions can cause a lot of problems. In Data science being able to determine if a result is due to chance can help in selecting a model as well as a way to check for sampling errors. One of the ways to build confidence that a result isn’t due purely to chance is by determining statistical significance. Let’s take a look at what statistical significance is how to determine it by using the p-value. This example will use Python to show how to represent statistical significance in your code/ jupyter notebook. It is recommended you understand some Python basics. We will also use the following Python libraries NumPy, Matplotlib, and Pandas. If you aren’t familiar with any of them I recommend you use the links provided. The DataWe’re going to look at the “Sales of summer clothes in E-commerce Wish” dataset from kaggle.com. If you aren’t familiar with Wish.com, it is an online e-commerce platform. The data was collected by scraping the website for all items listed under the category of “Summer” in August 2020. You can read more about how the data was obtained as well as each column/ variable by clicking the above link. The dataset shows that items where the company has a profile picture, have greater total sales(in euros) than items where the company doesn’t have a profile picture. We are going to try to determine if there is any statistical significance to this result. The Key Concepts
Hypothesis TestingTo establish statistical significance we have to come up with a null hypothesis and an alternative hypothesis. A null hypothesis is where the variable has no impact on our end result. The alternative hypothesis is where the variable does have an impact on the end result. For this example our null and alternative hypothesis will be:
The P-Value ThresholdThe p-value will is explained in depth later on but we will touch on it briefly now. It is always important to establish our p-value threshold before we start testing. The p-value is a way for us to quantify how rare our results are when determining if the null hypothesis is true. The lower the p-value, the less like the results are due purely to chance. The p-value threshold is a number we will choose that if crossed we can conclude our null hypothesis is true. We choose how high or how low we want this to be based on our test. If we are testing the statistical significance of something that needs to be very precise we might choose a very low threshold like .001. For a test that doesn’t require as much precision such as sales or a website’s engagement, we would go higher. The standard threshold is .05, or 5 %. Since we are looking at sales we will go with the standard threshold. If our p-value is under .05 we will reject the null hypothesis. Preparing Our DataOnce we have downloaded the dataset from Kaggle we can use pandas to read it into a DataFrame. There are 3 columns of interest to us “price”, “units_sold”, and “merchant_has_profile_picture”. We will take the following steps before moving on.
Below is the python code used to take the above sets. ***the multi-line comment syntax is used represent the output*** With our dataset ready, let’s split our dataset into two groups. One will contain items where the company has a profile picture(group a) and the other will contain the items where the company does not have a profile picture(group b). Once broken into two groups, we will find the mean of the total sales for each group. We will also look at how many rows are in each group as it will help us determine how to build our random groups later on. Test StatisticNow that we have the mean of the total sales from both groups we can use that to determine our test statistic. The test statistic is a numerical value that we will use to determine if the difference between our two groups is from random chance. In this case, the test statistic will be the mean difference of the total sales from using a profile picture and not using a profile picture. Our test statistic is about 33,184 euros. Permutation TestOur goal now is to recreate our two groups many times to see the likelihood that our test statistic occurred by chance. A permutation test is how we will accomplish this goal. We’ll take the following steps in Python:
From the histogram, we can see the most frequent difference in means is around 0. This is very different than our test statistic. In fact, doesn’t even show the test statistic value. The p-value will give us a way to quantify how many times the test statistic value is in the random sample distribution of mean differences. P-ValueIn the sampling distribution we generated, most of the values are centered around the mean difference of 0. If it were purely up to chance, it’s more likely both groups would have generated the same amount of total sales(the null hypothesis). But since the observed test statistic is not near 0, it’s possible that having a profile picture could be responsible for the mean difference in the dataset. Let’s look at the sampling distribution to determine the number of times a value of 33,184 or higher appeared. We can then divide that frequency by 1000. This will give us the probability that the mean difference of 33,184 or higher is purely due to random chance. This probability is called the p-value. If this value is high, it means that the test statistic could easily have happened randomly and the profile picture probably didn’t play a role. On the other hand, a low p-value implies that there’s a small probability that the mean difference we observed was because of random chance. Let’s create a for loop that will check tally how many times our test statistic shows up in our group of the randomly generated mean differences. We will then take that tally and divide it by our total iterations(1000). Our output will be the p-value. The OutcomeEarlier we decided out p-value threshold would be .05. 0 is less than .05 which means we can rule out our test statistic was a result of random chance. We can now reject our null hypothesis and confirm our alternative hypothesis. Final ThoughtsStatistical significance is a great way to examine data and determine if a variable has an impact on the final outcome. In this example, we took some steps with the help of Python to determine the statical significance of having a profile picture to the result of the total sales of an item. While we were able to quantify this significance with the help of the p-value we should still go back and look at the dataset. There a few issues that should be addressed before any major conclusions are made.
Although more information may be needed to make a decision on whether or not having a profile picture causes more sales we have taken a step in the right direction by determining the p-value. We can say based on this test it appears having a profile picture has a statistically significant impact on the total sales on an item on Wish.com. Statistical significance is a helpful tool in Data science and should be understood and used to help make decisions when you need to know if a result is significant or happened by chance. |