Hướng dẫn chi square python pandas

We will provide a practical example of how we can run a Chi-Square Test in Python. Assume that we want to test if there is a statistically significant difference in Genders (M, F) population between Smokers and Non-Smokers. Let’s generate some sample data to work on it.

Nội dung chính

  • Sample Data
  • Contingency Table
  • Chi-Square Test
  • Example of Chi-Square Test in Python
  • Sample Data
  • Chi-square test of independence
  • Chi-square Assumptions
  • Chi-square test using scipy.stats.chi2_contingency
  • Chi-square post hoc testing
  • Chi-square test of independence using custom function
  • Comparing scipy.stats.chi2_contingency() to custom chi2_table() function
  • How do you run a chi
  • What is chi2 in Python?
  • Can I use chi

Sample Data

import pandas as pd
import numpy as np
from scipy.stats import chi2_contingency

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline 
 
df = pd.DataFrame({'Gender' : ['M', 'M', 'M', 'F', 'F'] * 10,
                   'isSmoker' : ['Smoker', 'Smoker', 'Non-Smpoker', 'Non-Smpoker', 'Smoker'] * 10
                  })
df.head()
 
	Gender	isSmoker
0	M	Smoker
1	M	Smoker
2	M	Non-Smpoker
3	F	Non-Smpoker
4	F	Smoker
 

Contingency Table

To run the Chi-Square Test, the easiest way is to convert the data into a contingency table with frequencies. We will use the crosstab command from pandas.

contigency= pd.crosstab(df['Gender'], df['isSmoker'])
contigency
 
isSmokerNon-SmpokerSmoker
Gender
F 10 10
M 10 20

Let’s say that we want to get the percentages by Gender (row)

contigency_pct = pd.crosstab(df['Gender'], df['isSmoker'], normalize='index')
contigency_pct
 
isSmokerNon-SmpokerSmoker
Gender
F 0.500000 0.500000
M 0.333333 0.666667

If we want the percentages by column, then we should write normalize=’column’ and if we want the total percentage then we should write normalize=’all’


Heatmaps

An easy way to see visually the contingency tables are the heatmaps.

plt.figure(figsize=(12,8))
sns.heatmap(contigency, annot=True, cmap="YlGnBu")
 

Hướng dẫn chi square python pandas


Chi-Square Test

Now that we have built the contingency table we can pass it to chi2_contingencyfunction from the scipy package which returns the:

  • chi2: The test statistic
  • p: The p-value of the test
  • dof: Degrees of freedom
  • expected: The expected frequencies, based on the marginal sums of the table

# Chi-square test of independence.
c, p, dof, expected = chi2_contingency(contigency)
p
 
0.3767591178115821

Inference

The p-value is 37.67% which means that we do not reject the null hypothesis at 95% level of confidence. The null hypothesis was that Smokers and Gender are independent. In this example, the contingency table was 2×2. We could have applied z-test for proportions instead of Chi-Square test. Notice that the Chi-Square test can be extended to m x n contingency tables.

Example of Chi-Square Test in Python

Image by Predictive Hacks

We will provide a practical example of how we can run a Chi-Square Test in Python. Assume that we want to test if there is a statistically significant difference in Genders (M, F) population between Smokers and Non-Smokers. Let’s generate some sample data to work on it.

Sample Data

Sections of this page:

  • What is a Chi-square test of independence?
  • Chi-square test of independence assumptions
  • Data used in this example
  • Chi-square using scipy.stats.chi2_contingency
  • Chi-square pot-hoc testing
  • Chi-square test of independence using custom function
    • Comparing stats.chi2_contingency() to custom function

Chi-square test of independence

The Chi-square test of independence tests if there is a significant relationship between two categorical variables. The test is comparing the observed observations to the expected observations. The data is usually displayed in a cross-tabulation format with each row representing a category for one variable and each column representing a category for another variable. Chi-square test of independence is an omnibus test. Meaning it tests the data as a whole. This means that one will not be able to tell which levels (categories) of the variables are responsible for the relationship if the Chi-square table is larger than 2×2. If the test is larger than 2×2, it requires post hoc testing. If this doesn’t make much sense right now, don’t worry. Further explanation will be provided when we start working with the data.

The H0 (Null Hypothesis): There is no relationship between variable one and variable two.

The H1 (Alternative Hypothesis): There is a relationship between variable 1 and variable 2.

If the p-value is significant, you can reject the null hypothesis and claim that the findings support the alternative hypothesis.

Chi-square Assumptions

The following assumptions need to be meet in order for the results of the Chi-square test to be trusted.

  • When testing the data, the cells should be frequencies or counts of cases and not percentages. It is okay to convert to percentages after testing the data
  • The levels (categories) of the variables being tested are mutually exclusive
  • Each participant contributes to only one cell within the Chi-square table
  • The groups being tested must be independent
  • The value of expected cells should be greater than 5

If all of these assumptions are met, then Chi-square is the correct test to use.

This page will go over how to conduct a Chi-square test of independence using Python, how to interpret the results, and will provide a custom function that was developed by Python for Data Science, LLC for you to use! It cleans up the output, ability to calculate row/column percentages, and has the ability to export the results to a csv file.

First we need to import Pandas and Scipy Stats!

import pandas as pd
from scipy import stats
Data used for this Example

The data used in this example is from Kaggle.com from Open Sourcing Mental Illness, LTD. The data set is from the 2016 OSMI Mental Health in Tech Survey which aims to measure attitudes towards mental health in the tech workplace, and examine the frequency of mental health disorders among tech workers. Link to the Kaggle source of the data set is here.

For this example, we will test if there is an association between willingness to discuss a mental health issues with a direct supervisor and currently having a mental health disorder. In the data set, these are variables “Would you have been willing to discuss a mental health issue with your direct supervisor(s)?” and “Do you currently have a mental health disorder?” respectively. Let’s take a look at the data!

df['Do you currently have a mental health disorder?'].value_counts()
CategoryCount
Yes 575
No 531
Maybe 327
df['Would you have been willing to discuss a mental health issue with your direct supervisor(s)?'].value_counts()
CategoryCount
Some of my previous employers 654
No, at none of my previous employers 416
I don’t know 101
Yes, at all of my previous employers 93

For the variable “Do you currently have a mental health disorder?”, we are going to drop the responses of “Maybe” since we are only interested in if people know they do or do not have a mental health disorder. In order to do this, we need to use a function to recode the data. In addition, the variables will be renamed to shorten them.

def drop_maybe(series):
    if series.lower() == 'yes' or series.lower() == 'no':
        return series
    else:
        return
df['current_mental_disorder'] = df['Do you currently have a mental health disorder?'].apply(drop_maybe)
df['current_mental_disorder'].value_counts()
CategoryCount
Yes 575
No 531
df['willing_discuss_mh_supervisor'] = df['Would you have been willing to discuss a mental health issue with your direct supervisor(s)?']
df['willing_discuss_mh_supervisor'].value_counts()
CategoryCount
Some of my previous employers 654
No, at none of my previous employers 416
I don’t know 101
Yes, at all of my previous employers 93

Our data is set, so let’s take a look at the crosstab frequencies of the two groups.

pd.crosstab(df['willing_discuss_mh_supervisor'], df['current_mental_disorder'])
current_mental_disorder No Yes
willing_discuss_mh_supervisor
I don’t know 51 29
No, at none of my previous employers 119 194
Some of my previous employers 237 267
Yes, at all of my previous employers 51 24

Chi-square test using scipy.stats.chi2_contingency

You should have already imported Scipy.stats as stats, if you haven’t yet, do so now. The chi2_contingency() method conducts the Chi-square test on a contingency table (crosstab). The full documentation on this method can be found here on the official site. With that, first we need to assign our crosstab to a variable to pass it through the method.

crosstab = pd.crosstab(df['willing_discuss_mh_supervisor'], df['current_mental_disorder'])
crosstab

Now we can simply pass the crosstab variable through the chi2_contingency() method to conduct a Chi-square test of independence. The output doesn’t look neat in formatting, but all the required information is there.

While we check the results of the chi2 test, we need also to check that the expected cell frequencies are greater than or equal to 5; this is one of the assumptions (as mentioned above) for the chi2 test. If a cell has an expected frequency less that 5, then the Fisher’s Exact test should be use to overcome this problem. Interpretation of the results are the same. This information is also provided in the output.

stats.chi2_contingency(crosstab)

(32.408194625396376,
4.2928597930482389e-07,
3,
array([[ 37.69547325, 42.30452675],
[ 147.48353909, 165.51646091],
[ 237.48148148, 266.51851852],
[ 35.33950617, 39.66049383]]))

The first value (32.408) is the Chi-square value, followed by the p-value (4.29e-07), then comes the degrees of freedom (3), and lastly it outputs the expected frequencies as an array. Since all of the expected frequencies are greater than 5, the chi2 test results can be trusted. We can reject the null hypothesis as the p-value is less than 0.05. Thus, the results indicate that there is a relationship between willingness to discuss a mental health issues with a direct supervisor and currently having a mental health disorder within the tech/IT workplace.

Although our Chi-square test was signficant, since our analysis is 2×3 we can’t yet state exactly where the relationship is since the Chi-square test is an omnibus test. We have to conduct post hoc tests to test where the relationship is between the different levels (categories) of each variable. This example will use the Bonferroni-adjusted p-value method which will be covered in the next section.

If you don’t like the output of the Chi-square test, see the section Chi-square test of independence using custom function. The section provides code of a function that conducts the Chi-square test the same as we did, but the output is returned in a table with easy to read formatting, presentation of the values, rounding of data, can calculate the row/column percentages, and has the ability to output results directly to a csv file.

Chi-square post hoc testing

Now that we know our Chi-square test of independence is significant, we want to test where the relationship is between the levels of the variables. In order to do this, we need to conduct multiple 2×2 Chi-square tests using the Bonferroni-adjusted p-value.

Some of you may ask why? By comparing multiple levels (categories) against each other, the error rate of a false positive compounds with each test. Meaning, our first test at the level 0.05 is a 5% chance of a false positive; the test after that would be 10% chance of a false positive, and so forth. With each subsequent test, one would be increasing the error rate by 5%. If we were to conduct all of the possible 6 pairwise comparisons, our last 2×2 Chi-square test would have an error rate of 30%! Meaning our p-value being tested at would equal 0.30, which is not acceptable on any level.

To avoid this, the Bonferroni-adjusted method adjusts the p-value by how many planned pairwise comparisons are being conducted. The formula is p/N, where “p”= the original tests p-value and “N”= the number of planned pairwise comparisons.
In our example, if we were planning on conducting all possible pairwise comparisons then the formula would be 0.05/6 = 0.008. Meaning, a post hoc 2×2 Chi-square test would have to have a p-value less than 0.008 to be significant. However, we are not interested in the “I don’t know” category of the “willing_discuss_mh_supervisor” variable. Thus making the formula be 0.05/3, which equals 0.017. So for our planned pairwise comparisons to be significant, the p-value must be less than 0.017.

To conduct multiple 2×2 Chi-square tests, one needs to regroup the variables for each test to where it is one category against the rest. For us, it will be:

  • No, at none of my previous employers vs. the rest
  • Some of my previous employers vs, the rest
  • Yes, at all of my previous employers vs. the rest

Python makes this task easy! There is a pd.get_dummies() method which creates dummy variables where each new variable is only one category of the original variable and is equal to “1” if they belong in that category and “0” if they do not. We will assign the dummy variables to a new Python data frame.

dummies = pd.get_dummies(df['willing_discuss_mh_supervisor'])
dummies.drop(["I don't know"], axis= 1, inplace= True)
dummies.head()
No, at none of my previous employersSome of my previous employersYes, at all of my previous employers
0 1 0
0 1 0
0 0 0
0 1 0
0 1 0

Now that we have our dummy variables set, we can conduct our planned post hoc comparisons. This will be easy using a for loop. There is going to be a bit extra code in the for loop to clean up the output.

for series in dummies:
    nl = "\n"
    
    crosstab = pd.crosstab(dummies[f"{series}"], df['current_mental_disorder'])
    print(crosstab, nl)
    chi2, p, dof, expected = stats.chi2_contingency(crosstab)
    print(f"Chi2 value= {chi2}{nl}p-value= {p}{nl}Degrees of freedom= {dof}{nl}")
current_mental_disorder No Yes
No, at none of my previous employers
0 411 380
1 119 194

Chi2 value= 16.90623905539159
p-value= 3.927228826835633e-05
Degrees of freedom= 1

current_mental_disorder No Yes
Some of my previous employers
0 294 308
1 236 266

Chi2 value= 0.29589978434689185
p-value= 0.5864643795737425
Degrees of freedom= 1

current_mental_disorder No Yes
Yes, at all of my previous employers
0 479 550
1 51 24

Chi2 value= 12.040740742132103
p-value= 0.0005205028333059755
Degrees of freedom= 1

Using the Bonferroni-adjusted p-value of 0.017, 2 of the 3 planned pairwise comparisons are significant. There is a significant relationship between current_mental_disorder & No, at none of my previous employers, and current_mental_disorder & Yes, at all of my previous employers. Now we can compare the cells within the Chi-square test table.

  • Looking at current_mental_disorder & No, at none of my previous employers, it can be stated that a higher proportion of individuals with a current mental illness reported they would not have been willing to discuss a mental health issue with their direct supervisor.
  • Looking at current_mental_disorder & Yes, at all of my previous employers, it can be stated that a lower proportion of those with a current mental illness reported they would have been willing to discuss a mental health issue with their direct supervisor.

Chi-square test of independence using custom function

This custom function was developed by Python for Data Science, LLC and we want you to use it! It requires pandas to be imported as pd and scipy to be imported as stats – just like the libraries used in this example were imported at the top of this page. They are the most common form of importing these libraries.

What this function does is conducts the Chi-square test of independence using the scipy chi2_contingency() method, but it cleans up the formatting of the results, allows one to easily calculate row or column proportions (percentages) if desired, and allows one to export the results to a csv file if desired. The percentages are rounded to 2 decimal places while the Chi-square value and the p-value are round to 4 decimal places.

Examples will be provided at the bottom of the code. So here it is! Just copy and paste into the file you are working in, or save function to a separate .py file and import it from there.

def chi2_table(series1, series2, to_csv = False, csv_name = None, 
                prop= False):
    
    if type(series1) != list:
        crosstab = pd.crosstab(series1, series2)
        crosstab2 = pd.crosstab(series1, series2, margins= True)
        crosstab_proprow = round(crosstab2.div(crosstab2.iloc[:,-1], axis=0).mul(100, axis=0), 2)
        crosstab_propcol = round(crosstab2.div(crosstab2.iloc[-1,:], axis=1).mul(100, axis=1), 2)
        chi2, p, dof, expected = stats.chi2_contingency(crosstab)
        
        if prop == False:
            print("\n",
          f"Chi-Square test between " + series1.name + " and " + series2.name,
          "\n", "\n",
          crosstab2,
          "\n", "\n",
          f"Pearson Chi2({dof})= {chi2:.4f} p-value= {p:.4f}")
            
            if to_csv == True:
                if csv_name == None:
                    csv_name = f"{series2.name}.csv"
                                             
                file = open(csv_name, 'a')
                file.write(f"{crosstab2.columns.name}\n")
                file.close()
                crosstab2.to_csv(csv_name, header= True, mode= 'a')
                file = open(csv_name, 'a')
                file.write(f"Pearson Chi2({dof})= {chi2:.4f} p-value= {p:.4f}")
                file.write("\n")
                file.close()              
                
        if prop == 'Row':
            print("\n",
          f"Chi-Square test between " + series1.name + " and " + series2.name,
          "\n", "\n",
          crosstab_proprow,
          "\n", "\n",
          f"Pearson Chi2({dof})= {chi2:.4f} p-value= {p:.4f}")
            
            if to_csv == True:
                if csv_name == None:
                    csv_name = f"{series2.name}.csv"
                
                file = open(csv_name, 'a')
                file.write(f"{crosstab_proprow.columns.name}\n")
                file.close()
                crosstab_proprow.to_csv(csv_name, header= True, mode= 'a')
                file = open(csv_name, 'a')
                file.write(f"Pearson Chi2({dof})= {chi2:.4f} p-value= {p:.4f}")
                file.write("\n")
                file.close()

        if prop == 'Col':
            print("\n",
          f"Chi-Square test between " + series1.name + " and " + series2.name,
          "\n", "\n",
          crosstab_propcol,
          "\n", "\n",
          f"Pearson Chi2({dof})= {chi2:.4f} p-value= {p:.4f}")

            if to_csv == True:
                if csv_name == None:
                    csv_name = f"{series2.name}.csv"
                    
                file = open(csv_name, 'a')
                file.write(f"{crosstab_propcol.columns.name}\n")
                file.close()
                crosstab_propcol.to_csv(csv_name, header= True, mode= 'a')
                file = open(csv_name, 'a')
                file.write(f"Pearson Chi2({dof})= {chi2:.4f} p-value= {p:.4f}")
                file.write("\n")
                file.close()

    elif type(series1) == list and type(series2) == list:
        for entry2 in series2:
            for entry1 in series1:
                crosstab = pd.crosstab(entry1, entry2)
                crosstab2 = pd.crosstab(entry1, entry2, margins= True)
                crosstab_proprow = round(crosstab2.div(crosstab2.iloc[:,-1], axis=0).mul(100, axis=0), 2)
                crosstab_propcol = round(crosstab2.div(crosstab2.iloc[-1,:], axis=1).mul(100, axis=1), 2)
                chi2, p, dof, expected = stats.chi2_contingency(crosstab)
                
                if prop == False:
            
                    print("\n",
          f"Chi-Square test between " + entry1.name + " and " + entry2.name,
          "\n", "\n",
          crosstab2,
          "\n", "\n",
          f"Pearson Chi2({dof})= {chi2:.4f} p-value= {p:.4f}")
            
                    if to_csv == True:
                        
                        file = open("%s.csv" %(entry2.name), 'a')
                        file.write(f"{crosstab2.columns.name}\n")
                        file.close()
                        crosstab2.to_csv("%s.csv" %(entry2.name), header= True, mode= 'a')
                        file = open("%s.csv" %(entry2.name), 'a')
                        file.write(f"Pearson Chi2({dof})= {chi2:.4f} p-value= {p:.4f}")
                        file.write("\n")
                        file.close()                        

                if prop == 'Row':
            
                    print("\n",
          f"Chi-Square test between " + entry1.name + " and " + entry2.name,
          "\n", "\n",
          crosstab_proprow,
          "\n", "\n",
          f"Pearson Chi2({dof})= {chi2:.4f} p-value= {p:.4f}")
            
                    if to_csv == True:
                        file = open("%s.csv" %(entry2.name), 'a')
                        file.write(f"{crosstab_proprow.columns.name}\n")
                        file.close()
                        crosstab_proprow.to_csv("%s.csv" %(entry2.name), header= True, mode= 'a')
                        file = open("%s.csv" %(entry2.name), 'a')
                        file.write(f"Pearson Chi2({dof})= {chi2:.4f} p-value= {p:.4f}")
                        file.write("\n")
                        file.close()
                    
                if prop == 'Col':
            
                    print("\n",
          f"Chi-Square test between " + entry1.name + " and " + entry2.name,
          "\n", "\n",
          crosstab_propcol,
          "\n", "\n",
          f"Pearson Chi2({dof})= {chi2:.4f} p-value= {p:.4f}")
            
                    if to_csv == True:
                        file = open("%s.csv" %(entry2.name), 'a')
                        file.write(f"{crosstab_propcol.columns.name}\n")
                        file.close()
                        crosstab_propcol.to_csv("%s.csv" %(entry2.name), header= True, mode= 'a')
                        file = open("%s.csv" %(entry2.name), 'a')
                        file.write(f"Pearson Chi2({dof})= {chi2:.4f} p-value= {p:.4f}")
                        file.write("\n")
                        file.close()


    elif type(series1) == list:
        for entry in series1:
            crosstab = pd.crosstab(entry, series2)
            crosstab2 = pd.crosstab(entry, series2, margins= True)
            crosstab_proprow = round(crosstab2.div(crosstab2.iloc[:,-1], axis=0).mul(100, axis=0), 2)
            crosstab_propcol = round(crosstab2.div(crosstab2.iloc[-1,:], axis=1).mul(100, axis=1), 2)
            chi2, p, dof, expected = stats.chi2_contingency(crosstab)
            
            if prop == False:
                print("\n",
          f"Chi-Square test between " + entry.name + " and " + series2.name,
          "\n", "\n",
          crosstab2,
          "\n", "\n",
          f"Pearson Chi2({dof})= {chi2:.4f} p-value= {p:.4f}")
            
                if to_csv == True:
                    file = open("%s.csv" %(series2.name), 'a')
                    file.write(f"{crosstab2.columns.name}\n")
                    file.close()
                    crosstab2.to_csv("%s.csv" %(series2.name), header= True, mode= 'a')
                    file = open("%s.csv" %(series2.name), 'a')
                    file.write(f"Pearson Chi2({dof})= {chi2:.4f} p-value= {p:.4f}")
                    file.write("\n")
                    file.close()

            if prop == 'Row':
                print("\n",
          f"Chi-Square test between " + entry.name + " and " + series2.name,
          "\n", "\n",
          crosstab_proprow,
          "\n", "\n",
          f"Pearson Chi2({dof})= {chi2:.4f} p-value= {p:.4f}")
            
                if to_csv == True:
                    file = open("%s.csv" %(series2.name), 'a')
                    file.write(f"{crosstab_proprow.columns.name}\n")
                    file.close()
                    crosstab_proprow.to_csv("%s.csv" %(series2.name), header= True, mode= 'a')
                    file = open("%s.csv" %(series2.name), 'a')
                    file.write(f"Pearson Chi2({dof})= {chi2:.4f} p-value= {p:.4f}")
                    file.write("\n")
                    file.close()

            if prop == 'Col':
                print("\n",
          f"Chi-Square test between " + entry.name + " and " + series2.name,
          "\n", "\n",
          crosstab_propcol,
          "\n", "\n",
          f"Pearson Chi2({dof})= {chi2:.4f} p-value= {p:.4f}")
            
                if to_csv == True:
                    file = open("%s.csv" %(series2.name), 'a')
                    file.write(f"{crosstab_propcol.columns.name}\n")
                    file.close()
                    crosstab_propcol.to_csv("%s.csv" %(series2.name), header= True, mode= 'a')
                    file = open("%s.csv" %(series2.name), 'a')
                    file.write(f"Pearson Chi2({dof})= {chi2:.4f} p-value= {p:.4f}")
                    file.write("\n")
                    file.close()

Now to see the function in action. Some examples are below!

Comparing scipy.stats.chi2_contingency() to custom chi2_table() function

I will use the example from above so the data will be familiar.

crosstab = pd.crosstab(df['willing_discuss_mh_supervisor'], df['current_mental_disorder'])

stats.chi2_contingency(crosstab)

(32.408194625396376,
4.2928597930482389e-07,
3,
array([[ 37.69547325, 42.30452675],
[ 147.48353909, 165.51646091],
[ 237.48148148, 266.51851852],
[ 35.33950617, 39.66049383]]))

Now see the same analysis conducted with the custom chi2_table() function.

chi2_table(df['willing_discuss_mh_supervisor'], df['current_mental_disorder'])
Chi-Square test between willing_discuss_mh_supervisor and current_mental_disorder
current_mental_disorder No Yes All
willing_discuss_mh_supervisor
I don’t know 51 29 80
No, at none of my previous employers 119 194 313
Some of my previous employers 237 267 504
Yes, at all of my previous employers 51 24 75
All 458 514 972
Pearson Chi2(3)= 32.4082 p-value= 0.0000

The function can also handle having a list or multiple lists being passed. This comes in hand if you want to conduct multiple Chi-square tests on multiple variables. It will conduct all possible Chi-square test comparisons. If only a single list is being passed, it has to be passed in the first series entry.

chi2_table(list_1, df['current_mental_disorder'])

chi2_table(list_1, list_2)

If exporting to a csv file, the file will be named after the column variable (second series entry). In the example above the file name would be “current_mental_disorder.csv”, and name of the variable in list_2 respectively.

Getting Row/Column Proportions (Percentages)

To have proportions returned in the crosstab, pass either ‘Row’ or ‘Col’ into the “prop= ” argument. The data is tested using the count data, so we don’t violate on of the assumptions of the test, but then returns the proportion data.

To get row proportions, pass ‘Row’ into the “prop= ” argument for the function.

chi2_table(df['willing_discuss_mh_supervisor'], df['current_mental_disorder'], prop= 'Row')
Chi-Square test between willing_discuss_mh_supervisor and current_mental_disorder
current_mental_disorder No Yes All
willing_discuss_mh_supervisor
I don’t know 63.75 36.25 100
No, at none of my previous employers 38.02 61.98 100
Some of my previous employers 47.02 52.98 100
Yes, at all of my previous employers 68.00 32.00 100
All 47.12 52.88 100
Pearson Chi2(3)= 32.4082 p-value= 0.0000

To get column proportions, pass ‘Col’ into the “prop= ” argument for the function.

chi2_table(df['willing_discuss_mh_supervisor'], df['current_mental_disorder'], prop= 'Col')
Chi-Square test between willing_discuss_mh_supervisor and current_mental_disorder
current_mental_disorder No Yes All
willing_discuss_mh_supervisor
I don’t know 11.14 5.64 8.23
No, at none of my previous employers 25.98 37.74 32.30
Some of my previous employers 51.75 51.95 51.85
Yes, at all of my previous employers 11.14 4.67 7.72
All 100 100 100
Pearson Chi2(3)= 32.4082 p-value= 0.0000

Exporting to csv file

If you wish to export your results to a csv file, pass “to_csv= True”. If you wish to name the file, pass your file name as a string in the “csv_name= ” argument. If no file name is entered, it will automatically us the column variable name as the csv file name.

chi2_table(df['willing_discuss_mh_supervisor'], df['current_mental_disorder'], 
           to_csv= True)

chi2_table(df['willing_discuss_mh_supervisor'], df['current_mental_disorder'], 
           to_csv= True, csv_name = "Chi_square_tests.csv")

How do you run a chi

To use the chi-square test, we can take the following steps:.

Define the null (H0) and alternative (H1) hypothesis..

Determine the value of alpha (𝞪) for according to the domain you are working. ... .

Check the data for Nans or other kind of errors..

Check the assumptions for the test..

What is chi2 in Python?

Chi-square (χ2) test for independence (Pearson Chi-square test) Chi-square test is a non-parametric (distribution-free) method used to compare the relationship between the two categorical (nominal) variables in a contingency table.

Can I use chi

Both correlations and chi-square tests can test for relationships between two variables. However, a correlation is used when you have two quantitative variables and a chi-square test of independence is used when you have two categorical variables.