Hướng dẫn calculate p-value python pandas

Solving your problem requires both math and programming. Since df.corr returns pretty quickly for you case, i will focus on the p-value instead:

Show

Programming

scipy.stats.pearsonr(col_x, col_y) does not like dealing with NaN. So for each pair of columns, you have to remove all rows where one or both of the elements are NaN. You have 550 columns, hence 550 * 549 / 2 = 150,975 pairs. You better make sure your loop is extremely efficient.

If you view its source code, DataFrame.corr does it so blisteringly fast for 2 reasons:

  • It's coded in Cython and ran outside the Global Interpreter Lock (GIL). That means the looping is in bare-metal C and hence very fast
  • It implements its own variance algorithm (Welford's method) and does not rely on scipy.stats. The complexity of this algorithm is O(n * m^2) where n is the number of rows and m is the number of columns.

Math

The pearsonr documentation provides a description on how the p-value is calculated:

r = 
dist = scipy.stats.beta(n/2 - 1, n/2 - 1, loc=-1, scale=2)
p = 2 * dist.cdf(-abs(r))

Wikipedia gives us the formula for the CDF of the Beta distribution:

cdf(x, alpha, beta) = B(x, alpha, beta) / B(alpha, beta)
                    = scipy.special.betainc(alpha, beta, x)

Fortunately for us, the betainc function is vectorized so if we pass in 3 arrays of the same length as parameters, it will return an array as the output.


Solution 1

This solution is in native Python provide a reasonable performance on your dataset (500 * 550). Took about 30 seconds on my 2014 iMac with 16GB of RAM:

import scipy.special

def corr1(df):
    mask = df.notna().to_numpy()
    corr = df.corr().to_numpy()
    n_rows, n_cols = df.shape

    # Initialize the return arrays for better performance
    length = int(n_cols * (n_cols - 1) / 2)
    idx = np.empty((length, 2), dtype=object)
    correl = np.empty(length, dtype=np.float64)
    count = np.empty(length, dtype=np.uint64)

    # For 2-combination of columns, let `n` be the number of pairs whose
    # elements are all non-NaN. We will need that later to calculate the
    # p-value
    k = -1
    for i in range(n_cols):
        for j in range(i):
            n = 0
            for row in range(n_rows):
                n += 1 if mask[row, i] and mask[row, j] else 0

            k += 1
            idx[k] = (i, j)
            correl[k] = corr[i,j]
            count[k] = n

    # The p-value can be obtained with the incomplete Beta function (betainc)
    # We just need to massage the input a little
    alpha = count / 2 - 1
    x = (correl + 1) / 2
    x = np.where(correl < 0, x, 1 - x)
    p = 2 * scipy.special.betainc(alpha, alpha, x)
    
    return idx, correl, p

# Manipulate the return values into the right format
index, corr, p = corr1(df)

idx = pd.MultiIndex.from_tuples(
    [(df.columns[i], df.columns[j]) for i, j in index] +
    [(df.columns[j], df.columns[i]) for i, j in index]
)
full_index = pd.MultiIndex.from_product([df.columns, df.columns])
result = pd.DataFrame({
    'corr': np.tile(corr, 2),
    'p': np.tile(p, 2)
}, index=idx).reindex(full_index).unstack()

Solution 2

For the absolutely fastest solution, you would have to write it in Cython. This brings down the execution time from 30 to 5 seconds. Am I sure further optimization is possible (but I'm too lazy to explore them). The trade off is a more complex build and deployment process.

First, make sure you have a C compiler. Then install the Cython package:

pip install cython

Next, make 2 files: setup.py and utility.pyx:

#################################################
# setup.py
#################################################
from distutils.core import setup, Extension
from Cython.Build import cythonize
import numpy

compiler_directives = {
    'language_level': '3'
}
setup(
    ext_modules=cythonize("utility.pyx", compiler_directives=compiler_directives),
    include_dirs=[numpy.get_include()]
)
#################################################
# utility.pyx
#################################################
import cython
from cython import Py_ssize_t

import numpy as np
from numpy cimport (
    ndarray,
    float64_t,
    uint8_t,
    uint64_t,
)

import scipy.special

@cython.boundscheck(False)
@cython.wraparound(False)
def corr2(object df):
    # These variables go into the `nogil` context (i.e. into C land) so they
    # must be statically typed
    cdef:
        Py_ssize_t n_rows, n_cols, i, j, row, n, k
        ndarray[uint8_t, ndim=2] mask
        ndarray[float64_t, ndim=2] corr
        #
        ndarray[uint64_t, ndim=2] idx
        ndarray[float64_t, ndim=1] correl
        ndarray[uint64_t, ndim=1] count

    # We are still in Python land and thus have full access of all functions in
    # numpy and pandas. Converting pandas dataframes to a 2D numpy array
    # gives a huge speed boost
    mask = df.notna().to_numpy(dtype='uint8')
    corr = df.corr().to_numpy()
    n_rows, n_cols = df.shape

    # Initialize the return arrays
    length = int(n_cols * (n_cols - 1) / 2)
    idx = np.empty((length, 2), dtype=np.uint64)
    correl = np.empty(length, dtype=np.float64)
    count = np.empty(length, dtype=np.uint64)

    with nogil:
        # It's C-land in here. Everything must be statically defined

        k = -1
        # For 2-combination of columns, let `n` be the number of pairs whose
        # elements are all non-NaN. We will need that later to calculate the
        # p-value
        for i in range(n_cols):
            for j in range(i):
                n = 0
                for row in range(n_rows):
                    n += 1 if mask[row, i] and mask[row, j] else 0
            
                k += 1
                idx[k, 0] = i
                idx[k, 1] = j
                correl[k] = corr[i,j]
                count[k] = n
    
    # Back to Python-land
    # The p-value can be obtained with the incomplete Beta function (betainc)
    # We just need to massage the input a little
    alpha = count / 2 - 1
    x = (correl + 1) / 2                # since -1 <= correl <= 1, this makes 0 <= x <= 1
    x = np.where(correl < 0, x, 1 - x)  # don't ask me why. It's half-wrong and half-right without this line
    p = 2 * scipy.special.betainc(alpha, alpha, x)
    return idx, correl, p

Next, compile the utility.pyx into machine code:

python setup.py build_ext --inplace

Then you can use utility just like any other Python module:

from utility import corr2

index, corr, p = corr2(df)

idx = pd.MultiIndex.from_tuples(
    [(df.columns[i], df.columns[j]) for i, j in index] +
    [(df.columns[j], df.columns[i]) for i, j in index]
)
full_index = pd.MultiIndex.from_product([df.columns, df.columns])
result = pd.DataFrame({
    'corr': np.tile(corr, 2),
    'p': np.tile(p, 2)
}, index=idx).reindex(full_index).unstack()