Chi-square test unequal sample size python

You can use a chi-squared test in your example with different sample sizes. Your "another verb type" would be verbs that are not oral verbs, i.e. all the other verbs

Suppose in your example, $10$ of the $82$ verbs in sample one were oral verbs and $72$ were not, while $20$ of the $89$ verbs in sample two were oral verbs and $69$ were not. Then the table for your four cell chi-squared test could look like

10  72  |  82
20  69  |  89
__ ___    ___
        |
30 141  | 171

and in R you might get

chisq.test[rbind[c[10, 72], c[20, 69]]]

#     Pearson's Chi-squared test with Yates' continuity correction
#
# data:  rbind[c[10, 72], c[20, 69]]
# X-squared = 2.4459, df = 1, p-value = 0.1178

so this example would not be statistically significant

I have two sets of data as shown below. Each data set have a different length

X_data1 and Y_data1 [black binned data] have a length of 40 whereas X_data2 and Y_data2 [red] have a length of 18k.

I would like to perform a Chi-Square Goodness of Fit Test on these two data as follows

from scipy import stats
stats.chisquare[f_obs=Y_data1, f_exp=Y_data2]

But I can not since the vector size is not the same and I receive an error.

~/opt/miniconda3/lib/python3.9/site-packages/scipy/stats/stats.py in chisquare[f_obs, f_exp, ddof, axis] 6850 6851 """ -> 6852 return power_divergence[f_obs, f_exp=f_exp, ddof=ddof, axis=axis, 6853 lambda_="pearson"]
6854
~/opt/miniconda3/lib/python3.9/site-packages/scipy/stats/stats.py in power_divergence[f_obs, f_exp, ddof, axis, lambda_] 6676 if f_exp is not None: 6677 f_exp = np.asanyarray[f_exp] -> 6678 bshape = _broadcast_shapes[f_obs_float.shape, f_exp.shape] 6679 f_obs_float = _m_broadcast_to[f_obs_float, bshape] 6680 f_exp = _m_broadcast_to[f_exp, bshape]
~/opt/miniconda3/lib/python3.9/site-packages/scipy/stats/stats.py in _broadcast_shapes[shape1, shape2] 184 n = n1 185 else: --> 186 raise ValueError[f'shapes {shape1} and {shape2} could not be ' 187 'broadcast together'] 188 shape.append[n]
ValueError: shapes [40,] and [18200,] could not be broadcast together

Is there a way in Python that I can compare these two data?

Renesh Bedre 4 minute read

Chi-square [χ2] test for independence [Pearson Chi-square test]

Chi-square test is a non-parametric [distribution-free] method used to compare the relationship between the two categorical [nominal] variables in a contingency table.
For example, we have different treatments [treated and nontreated] and treatment outcomes [cured and noncured], here we could use the chi-square test for independence to check whether treatments are related to treatment outcomes.
Chi-square test relies on approximation [gives approximate p value] and hence require larger sample size. The expected frequency count should not be < 5 for more than 20% of cells. If the sample size is small, the chi-square test is not accurate, and you should use Fisher’s exact test.
Note: Chi-square test for independence is different than the chi-square goodness of fit test

Formula

Chi-square [χ2] test for independence for the 2x2 contingency table is equivalent to the Two-sample Z-Test for proportions. The χ2 statistics [uncorrected] is equal to the square of the Z statistics obtained from two independent sample’s proportions.

Hypotheses for Chi-square test for independence

Null hypothesis: The two categorical variables are independent [no association between the two variables] [ H0: Oi = Ei ]
- Alternative hypothesis: The two categorical variables are dependent [there is an association between the two variables] [ Ha: Oi ≠ Ei ]
- Note: There are no one or two-tailed p value. Rejection region of the chi-square test is always on the right side of the distribution.

Learn more about hypothesis testing and interpretation

Chi-square test assumptions

The two variables are categorical [nominal] and data is randomly sampled
The levels of variables are mutually exclusive
The expected frequency count for at least 80% of the cell in a contingency table is at least 5. Fisher’s exact test is appropriate for small frequency counts.
The expected frequency count should not be less than 1
Observations should be independent of each other
Observation data should be frequency counts and not percentages, proportions or transformed data

Calculate a chi-square test for independence in Python

We will use bioinfokit v0.9.5 or later and scipy python packages
Check bioinfokit documentation for installation and documentation
Download a hypothetical dataset for chi-square test for independence

Note: If you have your own dataset, you should import it as pandas dataframe. Learn how to import data using pandas

chi-square test for independence using bioinfokit,

from bioinfokit.analys import stat, get_data
# load example dataset
df = get_data['drugdata'].data
df.head[]
# output
   treatments  cured  noncured
0     treated     60        10
1  nontreated     30        25
# set treatments column as index
df = df.set_index['treatments']
# output
df.head[]
            cured  noncured
treatments
treated        60        10
nontreated     30        25

# run chi-square test for independence
res = stat[]
res.chisq[df=df]

# output
print[res.summary]
# corrected for the Yates’ continuity
Chi-squared test for independence

Test              Df    Chi-square      P-value
--------------  ----  ------------  -----------
Pearson            1       13.3365  0.000260291
Log-likelihood     1       13.4687  0.000242574

print[res.expected_df]

Expected frequency counts

      cured    noncured
--  -------  ----------
 0     50.4        19.6
 1     39.6        15.4

chi-square test for independence using chi2_contingency function from scipy package,

import numpy as np
from scipy.stats import chi2_contingency 
# using Pearson’s chi-squared statistic
# corrected for the Yates’ continuity
observed = np.array[[[60, 10], [30, 25]]]
chi_val, p_val, dof, expected =  chi2_contingency[observed]
chi_val, p_val, dof, expected
# output 
[13.3364898989899, 0.0002602911116400899, 1, array[[[50.4, 19.6],
       [39.6, 15.4]]]]

# without Yates’ correction for continuity
chi_val, p_val, dof, expected =  chi2_contingency[observed, correction=False]
chi_val, p_val, dof, expected
# output 
[14.842300556586274, 0.00011688424010613195, 1, array[[[50.4, 19.6],
       [39.6, 15.4]]]]
       
# for log-likelihood method run command as below
chi_val, p_val, dof, expected =  chi2_contingency[observed, lambda_="log-likelihood"]

Yates’ correction for continuity

In the χ2 test, the discrete probabilities of observed counts can be approximated by the continuous chi-squared probability distribution. This can cause errors and needs to be corrected using continuity correction.
Yates’ correction for continuity modifies the 2x2 contingency table and adjust the difference of observed and expected counts by subtracting the value of 0.5 [see formula].
Yates’ correction for continuity increases the p value by reducing the χ2 value. The corrected p value is close to exact tests such as the Fisher exact test. Sometimes, Yates’ correction may give an overcorrected p value.
χ2 and Yates’ corrected χ2 produce similar results on large samples, but Yates’ corrected χ2 can be conservative on smaller samples and gives a higher p value.

Interpretation

The p value obtained from chi-square test for independence is significant [p < 0.05], and therefore, we conclude that there is a significant association between treatments [treated and nontreated] with treatment outcome [cured and noncured]

Fisher’s exact test in R [with example]

References

Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, Burovski E, Peterson P, Weckesser W, Bright J, van der Walt SJ. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nature methods. 2020 Mar;17[3]:261-72.
Bewick V, Cheek L, Ball J. Statistics review 8: Qualitative data–tests of association. Critical care. 2003 Feb 1;8[1]:46.
Serra N, Rea T, Di Carlo P, Sergi C. Continuity correction of Pearson’s chi-square test in 2x2 Contingency Tables: A mini-review on recent development. Epidemiology, Biostatistics and Public Health. 2019 Jun 21;16[2].

If you have any questions, comments or recommendations, please email me at

If you enhanced your knowledge and practical skills from this article, consider supporting me on

This work is licensed under a Creative Commons Attribution 4.0 International License

Can you do a chi

And luckily, unequal sample sizes do not affect the ability to calculate that chi-square test statistic. It's pretty rare to have equal sample sizes, in fact. The expected values take the sample sizes into account.

Can you run at test with unequal sample sizes?

Yes, you can perform a t-test when the sample sizes are not equal. Equal sample sizes is not one of the assumptions made in a t-test. The real issues arise when the two samples do not have equal variances, which is one of the assumptions made in a t-test.

How do you compare data with different sample sizes?

One way to compare the two different size data sets is to divide the large set into an N number of equal size sets. The comparison can be based on absolute sum of of difference. THis will measure how many sets from the Nset are in close match with the single 4 sample set.

Do sample sizes need to be equal?

A sample size imbalance isn't a tell-tale sign of a poor study. You don't need equal-sized groups to compute accurate statistics. If the sample size imbalance is due to drop-outs rather than due to design, simple randomisation or technical glitches, this is something to take into account when interpreting the results.