Convert z-score to percentage python

Google doesn't want to help!

I'm able to calculate z-scores, and we are trying to produce a function that given a z-score gives us a percent of the population in a normal distribution that would be under that z-score. All I can find are references to z-score to percentage tables.

Any pointers?

asked May 6, 2010 at 15:29

Is it this z-score (link) you're talking about?

If so, the function you're looking for is called the normal cumulative distribution, also sometimes referred to as the error function (although Wikipedia defines the two slightly differently). How to calculate it depends on what programming environment you're using.

answered May 6, 2010 at 15:33

Convert z-score to percentage python

David ZDavid Z

124k26 gold badges249 silver badges275 bronze badges

3

If you're programming in C++, you can do this with the Boost library, which has routines for working with normal distributions. You are looking for the cdf accessor function, which takes a z-score as input and returns the probability you want.

answered May 6, 2010 at 15:35

Jim LewisJim Lewis

42.3k6 gold badges84 silver badges96 bronze badges

Here's a code snippet for python:

import math

def percentage_of_area_under_std_normal_curve_from_zcore(z_score):
    return .5 * (math.erf(z_score / 2 ** .5) + 1)

Using the following photo for reference: http://www.math.armstrong.edu/statsonline/5/cntrl8.gif

The z-score is 1.645, and that covers 95 percent of the area under the standard normal distribution curve.

When you run the code, it looks like this:

>>> std_normal_percentile_from_zcore(1.645)
0.9500150944608786

More about the error function: http://en.wikipedia.org/wiki/Error_function

answered Nov 29, 2012 at 22:05

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters

def zptile(z_score):
return .5 * (math.erf(z_score / 2 ** .5) + 1)
zptile(0.95)
# excel says: 0.8289438737
0.8289438736915181
via: http://stackoverflow.com/questions/2782284/function-to-convert-a-z-score-into-a-percentage
To get this to work in a dataframe you need to use the APPLY function
df["PTILE"]=df["ZSCORE"].apply(zptile)
df["PTILE"]=zptile(df["ZSCORE"]) will not work...
see: http://stackoverflow.com/questions/23748842/understanding-math-errors-in-pandas-dataframes
To iterate over a slice of columns and perform the calculation on each one -- in this case i'm starting with the column 'pop' and taking every column to the right, iterating through, creating a new zscore percentile column called columnname_p:
for col in df.ix[:,'pop':]:
df[col+"_p"]=((df[col] - df[col].mean())/df[col].std(ddof=0)).apply(zptile)*100

In this tutorial, you’ll learn how to use Python to calculate a z-score for an array of numbers. You’ll learn a brief overview of what the z-score represents in statistics and how it’s relevant to machine learning. You’ll then learn how to calculate a z-score from scratch in Python as well as how to use different Python modules to calculate the z-score.

By the end of this tutorial, you’ll have learned how to use scipy and pandas modules to calculate the z-score. Each of these approaches has different benefits and drawbacks. In large part, determining which approach works best for you depends on a number of different factors. For example, you may not want to import a different library only to calculate a statistical measure. Alternatively, you may want more control over how to calculate z-scores and rely on the flexibility that scipy gives you.

The Quick Answer: scipy.stats’ zscore() to Calculate a z-score in Python

# Calculate the z-score from with scipy
import scipy.stats as stats
values = [4,5,6,6,6,7,8,12,13,13,14,18]

zscores = stats.zscore(values)
print(zscores)
# Returns: [-1.2493901  -1.01512945 -0.78086881 -0.78086881 -0.78086881 -0.54660817 -0.31234752  0.62469505  0.85895569  0.85895569  1.09321633  2.0302589 ]

  • What is the Z-Score and how is it used in Machine Learning?
  • How to Calculate a Z-Score from Scratch in Python
  • How to Use Scipy to Calculate a Z-Score
  • How to Use Pandas to Calculate a Z-Score
  • Calculate a z-score From a Mean and Standard Deviation in Python
  • Conclusion
  • Additional Resources

What is the Z-Score and how is it used in Machine Learning?

The z-score is a score that measures how many standard deviations a data point is away from the mean. The z-score allows us to determine how usual or unusual a data point is in a distribution. The z-score allows us more easily compare datapoints for a record across features, especially when the different features have significantly different ranges.

The z-score must be used with a normal distribution, which is one of the prerequisites for calculating a standard deviation. We know that in a normal distribution, over 99% of values fall within 3 standard deviations from the mean. Because of this, we can assume that if a z-score returned is larger than 3 that the value is quite unusual.

The benefit of this standardization is that it doesn’t rely on the original values of the feature in the dataset. Because of this, we’re able to more easily compare the impact of one feature to another.

The z-score is generally calculated for each value in a given feature. It takes into account the standard deviation and the mean of the feature. The formula for the z-score looks like this:

Convert z-score to percentage python
The formula for a z-score

For each value in an array, the z-score is calculated by dividing the difference between the value and the mean by the standard deviation of the distribution. Because of this, the z-score can be either positive or negative, indicating whether the value is larger or smaller than the mean.

In the next section, you’ll learn how to calculate the z-score from scratch in Python.

In order to calculate the z-score, we need to first calculate the mean and the standard deviation of an array. To learn how to calculate the standard deviation in Python, check out my guide here.

To calculate the standard deviation from scratch, let’s use the code below:

# Calculate the Standard Deviation in Python
mean = sum(values) / len(values)
differences = [(value - mean)**2 for value in values]
sum_of_differences = sum(differences)
standard_deviation = (sum_of_differences / (len(values) - 1)) ** 0.5

print(standard_deviation)
# Returns: 1.3443074553223537

Now that we have the mean and the standard deviation, we can loop over the list of values and calculate the z-scores. We can do this by subtracting the mean from the value and dividing this by the standard deviation.

In order to do this, let’s use a Python list comprehension to loop over each value:

# Calculate the z-score from scratch
zscores = [(value - mean) / standard_deviation for value in values]

print(zscores)
# Returns: [-3.9673463925367023, -3.2234689439360706, -2.479591495335439, -2.479591495335439, -2.479591495335439, -1.7357140467348073, -0.9918365981341759, 1.9836731962683505, 2.727550644868982, 2.727550644868982, 3.4714280934696133, 6.4469378878721395]

This approach works, but it’s a bit verbose. I wanted to cover it off here to provide a mean to calculate the z-score with just pure Python. It can also be a good method to demonstrate in Python coding interviews.

That being said, there are much easier ways to accomplish this. In the next section, you’ll learn how to calculate the z-score with scipy.

How to Use Scipy to Calculate a Z-Score

The most common way to calculate z-scores in Python is to use the scipy module. The module has numerous statistical functions available through the scipy.stats module, including the one we’ll be using in this tutorial: zscore().

The zscore() function takes an array of values and returns an array containing their z-scores. It implicitly handles calculating the mean and the standard deviation, so we don’t need to calculate those ourselves. This has the benefit of saving us many lines of code, but also allows our code to be more readable.

Let’s see how we can use the scipy.stats package to calculate z-scores:

# Calculate the z-score from with scipy
import scipy.stats as stats
values = [4,5,6,6,6,7,8,12,13,13,14,18]

zscores = stats.zscore(values)
print(zscores)
# Returns: [-1.2493901  -1.01512945 -0.78086881 -0.78086881 -0.78086881 -0.54660817 -0.31234752  0.62469505  0.85895569  0.85895569  1.09321633  2.0302589 ]

We can see how easy it was to calculate the z-scores in Python using scipy! One important thing to note here is that the scipy.stats.zscore() function doesn’t return a list. It actually returns a numpy array.

In the next section, you’ll learn how to use Pandas and scipy to calculate z-scores for a Pandas Dataframe.

How to Use Pandas to Calculate a Z-Score

There may be many times when you want to calculate the z-scores for a Pandas Dataframe. In this section, you’ll learn how to calculate the z-score for a Pandas column as well as for an entire dataframe. In order to do this, we’ll be using the scipy library to accomplish this.

Let’s load a sample Pandas Dataframe to calculate our z-scores:

# Loading a Sample Pandas Dataframe
import pandas as pd

df = pd.DataFrame.from_dict({
    'Name': ['Nik', 'Kate', 'Joe', 'Mitch', 'Alana'],
    'Age': [32, 30, 67, 34, 20],
    'Income': [80000, 90000, 45000, 23000, 12000],
    'Education' : [5, 7, 3, 4, 4]
})

print(df.head())

# Returns:
#     Name  Age  Income  Education
# 0    Nik   32   80000          5
# 1   Kate   30   90000          7
# 2    Joe   67   45000          3
# 3  Mitch   34   23000          4
# 4  Alana   20   12000          4

We can see that by using the Pandas .head() dataframe method, that we have a dataframe with four columns. Three of these are numerical columns, for which we can calculate the z-score.

We can use the scipy.stats.zscore() function to calculate the z-scores on a Pandas dataframe column. Let’s create a new column that contains the values from the Income column normalized using the z-score:

df['Income zscore'] = stats.zscore(df['Income'])
print(df.head())

# Returns:
#     Name  Age  Income  Education  Income zscore
# 0    Nik   32   80000          5       0.978700
# 1   Kate   30   90000          7       1.304934
# 2    Joe   67   45000          3      -0.163117
# 3  Mitch   34   23000          4      -0.880830
# 4  Alana   20   12000          4      -1.239687

One of the benefits of calculating z-scores is to actually normalize values across features. Because of this, it’s often useful to calculate the z-scores for all numerical columns in a dataframe.

Let’s see how we can convert our dataframe columns to z-scores using the Pandas .apply() method:

df = df.select_dtypes(include='number').apply(stats.zscore)
print(df.head())

# Returns:
#         Age    Income  Education
# 0 -0.288493  0.978700   0.294884
# 1 -0.413925  1.304934   1.769303
# 2  1.906565 -0.163117  -1.179536
# 3 -0.163061 -0.880830  -0.442326
# 4 -1.041085 -1.239687  -0.442326

In the example above, we first select only numeric columns using the .select_dtypes() method and then use the .apply() method to apply the zscore function.

The benefit of this, is that we’re now able to compare the features in relation to one another in a way that isn’t impacted by their distributions.

Calculate a z-score From a Mean and Standard Deviation in Python

In this final section, you’ll learn how to calculate a z-score when you know a mean and a standard deviation of a distribution. The benefit of this approach is to be able to understand how far away from the mean a given value is. This approach is available only in Python 3.9 onwards.

For this approach, we can use the statistics library, which comes packed into Python. The module comes with a function, NormalDist, which allows us to pass in both a mean and a standard deviation. This creates a NormalDist object, where we can pass in a zscore value

Let’s take a look at an example:

# Calculate a z-score from a provided mean and standard deviation
import statistics
mean = 7
standard_deviation = 1.3

zscore = statistics.NormalDist(mean, standard_deviation).zscore(5)
print(zscore)

# Returns: -1.5384615384615383

We can see that this returns a value of -1.538, meaning that the value is roughly 1.5 standard deviations away from the mean.

Conclusion

In this tutorial, you learned how to use Python to calculate a z-score. You learned how to use the scipy module to calculate a z-score and how to use Pandas to calculate it for a column and an entire dataframe. Finally, you learned how to use the statistics library to calculate a zscore, when you know a mean, standard deviation and a value.

To learn more about the scipy zscore function, check out the official documentation here.

Additional Resources

To learn more about related topics, check out these articles here:

  • Python Standard Deviation Tutorial: Explanation & Examples
  • Normalize a Pandas Column or Dataframe (w/ Pandas or sklearn)
  • Pandas Describe: Descriptive Statistics on Your Dataframe
  • Pandas Quantile: Calculate Percentiles of a Dataframe

How do you convert z

Subtract the value you just derived from 100 to calculate the percentage of values in your data set which are below the value you converted to a Z-score. In the example, you would calculate 100 minus 0.22 and conclude that 99.78 percent of students scored below 2,000.

Is the z

A z-score table shows the percentage of values (usually a decimal figure) to the left of a given z-score on a standard normal distribution. For example, imagine our Z-score value is 1.09.

How do you scale Z

We can calculate z-scores in Python using scipy.stats.zscore, which uses the following syntax:.
scipy.stats.zscore(a, axis=0, ddof=0, nan_policy='propagate').
Step 1: Import modules..
Step 2: Create an array of values..
Step 3: Calculate the z-scores for each value in the array..
Additional Resources:.

How do you find p

We use scipy. stats. norm. sf() function for calculating p-value from z-score.