Google doesn't want to help!
I'm able to calculate z-scores, and we are trying to produce a function that given a z-score gives us a percent of the population in a normal distribution that would be under that z-score. All I can find are references to z-score to percentage tables.
Any pointers?
asked May 6, 2010 at 15:29
Is it this z-score [link] you're talking about?
If so, the function you're looking for is called the normal cumulative distribution, also sometimes referred to as the error function [although Wikipedia defines the two slightly differently]. How to calculate it depends on what programming environment you're using.
answered May 6, 2010 at 15:33
David ZDavid Z
124k26 gold badges249 silver badges275 bronze badges
3
If you're programming in
C++, you can do this with the Boost library, which has routines for working with normal distributions. You are looking for the cdf
accessor function, which takes a z-score as input and returns the probability you want.
answered May 6, 2010 at 15:35
Jim LewisJim Lewis
42.3k6 gold badges84 silver badges96 bronze badges
Here's a code snippet for python:
import math
def percentage_of_area_under_std_normal_curve_from_zcore[z_score]:
return .5 * [math.erf[z_score / 2 ** .5] + 1]
Using the following photo for reference: //www.math.armstrong.edu/statsonline/5/cntrl8.gif
The z-score is 1.645, and that covers 95 percent of the area under the standard normal distribution curve.
When you run the code, it looks like this:
>>> std_normal_percentile_from_zcore[1.645]
0.9500150944608786
More about the error function: //en.wikipedia.org/wiki/Error_function
answered Nov 29, 2012 at 22:05
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
def zptile[z_score]: | |
return .5 * [math.erf[z_score / 2 ** .5] + 1] | |
zptile[0.95] | |
# excel says: 0.8289438737 | |
0.8289438736915181 | |
via: //stackoverflow.com/questions/2782284/function-to-convert-a-z-score-into-a-percentage | |
To get this to work in a dataframe you need to use the APPLY function | |
df["PTILE"]=df["ZSCORE"].apply[zptile] | |
df["PTILE"]=zptile[df["ZSCORE"]] will not work... | |
see: //stackoverflow.com/questions/23748842/understanding-math-errors-in-pandas-dataframes | |
To iterate over a slice of columns and perform the calculation on each one -- in this case i'm starting with the column 'pop' and taking every column to the right, iterating through, creating a new zscore percentile column called columnname_p: | |
for col in df.ix[:,'pop':]: | |
df[col+"_p"]=[[df[col] - df[col].mean[]]/df[col].std[ddof=0]].apply[zptile]*100 |
In this tutorial, you’ll learn how to use Python to calculate a z-score for an array of numbers. You’ll learn a brief overview of what the z-score represents in statistics and how it’s relevant to machine learning. You’ll then learn how to calculate a z-score from scratch in Python as well as how to use different Python modules to calculate the z-score.
By the end of this tutorial, you’ll have learned how to use scipy
and pandas
modules to calculate the
z-score. Each of these approaches has different benefits and drawbacks. In large part, determining which approach works best for you depends on a number of different factors. For example, you may not want to import a different library only to calculate a statistical measure. Alternatively, you may want more control over how to calculate z-scores and rely on the flexibility that scipy
gives you.
The Quick Answer: scipy.stats’ zscore[] to Calculate a z-score in Python
# Calculate the z-score from with scipy
import scipy.stats as stats
values = [4,5,6,6,6,7,8,12,13,13,14,18]
zscores = stats.zscore[values]
print[zscores]
# Returns: [-1.2493901 -1.01512945 -0.78086881 -0.78086881 -0.78086881 -0.54660817 -0.31234752 0.62469505 0.85895569 0.85895569 1.09321633 2.0302589 ]
- What is the Z-Score and how is it used in Machine Learning?
- How to Calculate a Z-Score from Scratch in Python
- How to Use Scipy to Calculate a Z-Score
- How to Use Pandas to Calculate a Z-Score
- Calculate a z-score From a Mean and Standard Deviation in Python
- Conclusion
- Additional Resources
What is the Z-Score and how is it used in Machine Learning?
The z-score is a score that measures how many standard deviations a data point is away from the mean. The z-score allows us to determine how usual or unusual a data point is in a distribution. The z-score allows us more easily compare datapoints for a record across features, especially when the different features have significantly different ranges.
The z-score must be used with a normal distribution, which is one of the prerequisites for calculating a standard deviation. We know that in a normal distribution, over 99% of values fall within 3 standard deviations from the mean. Because of this, we can assume that if a z-score returned is larger than 3 that the value is quite unusual.
The benefit of this standardization is that it doesn’t rely on the original values of the feature in the dataset. Because of this, we’re able to more easily compare the impact of one feature to another.
The z-score is generally calculated for each value in a given feature. It takes into account the standard deviation and the mean of the feature. The formula for the z-score looks like this:
For each value in an array, the z-score is calculated by dividing the difference between the value and the mean by the standard deviation of the distribution. Because of this, the z-score can be either positive or negative, indicating whether the value is larger or smaller than the mean.
In the next section, you’ll learn how to calculate the z-score from scratch in Python.
In order to calculate the z-score, we need to first calculate the mean and the standard deviation of an array. To learn how to calculate the standard deviation in Python, check out my guide here.
To calculate the standard deviation from scratch, let’s use the code below:
# Calculate the Standard Deviation in Python
mean = sum[values] / len[values]
differences = [[value - mean]**2 for value in values]
sum_of_differences = sum[differences]
standard_deviation = [sum_of_differences / [len[values] - 1]] ** 0.5
print[standard_deviation]
# Returns: 1.3443074553223537
Now that we have the mean and the standard deviation, we can loop over the list of values and calculate the z-scores. We can do this by subtracting the mean from the value and dividing this by the standard deviation.
In order to do this, let’s use a Python list comprehension to loop over each value:
# Calculate the z-score from scratch
zscores = [[value - mean] / standard_deviation for value in values]
print[zscores]
# Returns: [-3.9673463925367023, -3.2234689439360706, -2.479591495335439, -2.479591495335439, -2.479591495335439, -1.7357140467348073, -0.9918365981341759, 1.9836731962683505, 2.727550644868982, 2.727550644868982, 3.4714280934696133, 6.4469378878721395]
This approach works, but it’s a bit verbose. I wanted to cover it off here to provide a mean to calculate the z-score with just pure Python. It can also be a good method to demonstrate in Python coding interviews.
That being said, there are much easier ways to accomplish this. In the next section, you’ll learn how to calculate the z-score with scipy.
How to Use Scipy to Calculate a Z-Score
The most common way to calculate z-scores in Python is to use the scipy
module. The module has
numerous statistical functions available through the scipy.stats
module, including the one we’ll be using in this tutorial: zscore[]
.
The zscore[]
function takes an array of values and returns an array containing their z-scores. It implicitly handles calculating the mean and the standard deviation, so we don’t need to calculate those ourselves. This has the benefit of saving us many lines of code, but also allows our code to be more readable.
Let’s see how we can use the scipy.stats
package to calculate z-scores:
# Calculate the z-score from with scipy
import scipy.stats as stats
values = [4,5,6,6,6,7,8,12,13,13,14,18]
zscores = stats.zscore[values]
print[zscores]
# Returns: [-1.2493901 -1.01512945 -0.78086881 -0.78086881 -0.78086881 -0.54660817 -0.31234752 0.62469505 0.85895569 0.85895569 1.09321633 2.0302589 ]
We can see how easy it was to calculate the z-scores in Python using scipy! One important thing to note here is that the scipy.stats.zscore[]
function doesn’t return a list. It actually returns a numpy array.
In the next section, you’ll learn how to use Pandas and scipy to calculate z-scores for a Pandas Dataframe.
How to Use Pandas to Calculate a Z-Score
There may be many times when you want to calculate the z-scores for a Pandas Dataframe. In this section, you’ll learn how to calculate the z-score for a Pandas column as well as for an entire dataframe. In order to do this, we’ll be using the scipy library to accomplish this.
Let’s load a sample Pandas Dataframe to calculate our z-scores:
# Loading a Sample Pandas Dataframe
import pandas as pd
df = pd.DataFrame.from_dict[{
'Name': ['Nik', 'Kate', 'Joe', 'Mitch', 'Alana'],
'Age': [32, 30, 67, 34, 20],
'Income': [80000, 90000, 45000, 23000, 12000],
'Education' : [5, 7, 3, 4, 4]
}]
print[df.head[]]
# Returns:
# Name Age Income Education
# 0 Nik 32 80000 5
# 1 Kate 30 90000 7
# 2 Joe 67 45000 3
# 3 Mitch 34 23000 4
# 4 Alana 20 12000 4
We can see that by using the Pandas .head[]
dataframe method, that we have a dataframe with four columns. Three of these are numerical columns,
for which we can calculate the z-score.
We can use the scipy.stats.zscore[]
function to calculate the z-scores on a Pandas dataframe column. Let’s create a new column that contains the values from the Income column normalized using the z-score:
df['Income zscore'] = stats.zscore[df['Income']]
print[df.head[]]
# Returns:
# Name Age Income Education Income zscore
# 0 Nik 32 80000 5 0.978700
# 1 Kate 30 90000 7 1.304934
# 2 Joe 67 45000 3 -0.163117
# 3 Mitch 34 23000 4 -0.880830
# 4 Alana 20 12000 4 -1.239687
One of the benefits of calculating z-scores is to actually normalize values across features. Because of this, it’s often useful to calculate the z-scores for all numerical columns in a dataframe.
Let’s see how we can convert our
dataframe columns to z-scores using the Pandas .apply[]
method:
df = df.select_dtypes[include='number'].apply[stats.zscore]
print[df.head[]]
# Returns:
# Age Income Education
# 0 -0.288493 0.978700 0.294884
# 1 -0.413925 1.304934 1.769303
# 2 1.906565 -0.163117 -1.179536
# 3 -0.163061 -0.880830 -0.442326
# 4 -1.041085 -1.239687 -0.442326
In the example above, we first select only numeric columns using the .select_dtypes[]
method and then use the .apply[]
method to apply the zscore function.
The benefit of this, is that we’re now able to compare the features in relation to one another in a way that isn’t impacted by their distributions.
Calculate a z-score From a Mean and Standard Deviation in Python
In this final section, you’ll learn how to calculate a z-score when you know a mean and a standard deviation of a distribution. The benefit of this approach is to be able to understand how far away from the mean a given value is. This approach is available only in Python 3.9 onwards.
For this approach, we can use the statistics
library, which comes packed into Python. The module comes with a function, NormalDist
, which allows us to pass in
both a mean and a standard deviation. This creates a NormalDist object, where we can pass in a zscore value
Let’s take a look at an example:
# Calculate a z-score from a provided mean and standard deviation
import statistics
mean = 7
standard_deviation = 1.3
zscore = statistics.NormalDist[mean, standard_deviation].zscore[5]
print[zscore]
# Returns: -1.5384615384615383
We can see that this returns a value of -1.538
, meaning that the value is roughly 1.5 standard deviations away from the mean.
Conclusion
In this tutorial, you learned how to use Python to calculate a z-score. You learned how to use the scipy module to calculate a z-score and how to use Pandas to calculate it for a column and an entire dataframe. Finally, you learned how to use the statistics library to calculate a zscore, when you know a mean, standard deviation and a value.
To learn more about the scipy zscore function, check out the official documentation here.
Additional Resources
To learn more about related topics, check out these articles here:
- Python Standard Deviation Tutorial: Explanation & Examples
- Normalize a Pandas Column or Dataframe [w/ Pandas or sklearn]
- Pandas Describe: Descriptive Statistics on Your Dataframe
- Pandas Quantile: Calculate Percentiles of a Dataframe