Solving your problem requires both math and programming. Since df.corr
returns pretty quickly for you case, i will focus on the p-value instead:
Programming
scipy.stats.pearsonr[col_x, col_y]
does not like dealing with NaN. So for each pair of columns, you have to remove all rows where one or both of the elements are NaN. You have 550 columns, hence 550 * 549 / 2 = 150,975 pairs. You better make sure your loop is extremely efficient.
If you view its
source code, DataFrame.corr
does it so blisteringly fast for 2 reasons:
- It's coded in Cython and ran outside the Global Interpreter Lock [GIL]. That means the looping is in bare-metal C and hence very fast
- It implements its own variance algorithm
[Welford's method] and does not rely on
scipy.stats
. The complexity of this algorithm isO[n * m^2]
wheren
is the number of rows andm
is the number of columns.
Math
The pearsonr
documentation provides a
description on how the p-value is calculated:
r =
dist = scipy.stats.beta[n/2 - 1, n/2 - 1, loc=-1, scale=2]
p = 2 * dist.cdf[-abs[r]]
Wikipedia gives us the formula for the CDF of the Beta distribution:
cdf[x, alpha, beta] = B[x, alpha, beta] / B[alpha, beta]
= scipy.special.betainc[alpha, beta, x]
Fortunately for us, the betainc
function is vectorized so if we pass in 3 arrays of the same length as parameters, it will return an array as the output.
Solution 1
This solution is in native Python provide a reasonable performance on your dataset [500 * 550]. Took about 30 seconds on my 2014 iMac with 16GB of RAM:
import scipy.special
def corr1[df]:
mask = df.notna[].to_numpy[]
corr = df.corr[].to_numpy[]
n_rows, n_cols = df.shape
# Initialize the return arrays for better performance
length = int[n_cols * [n_cols - 1] / 2]
idx = np.empty[[length, 2], dtype=object]
correl = np.empty[length, dtype=np.float64]
count = np.empty[length, dtype=np.uint64]
# For 2-combination of columns, let `n` be the number of pairs whose
# elements are all non-NaN. We will need that later to calculate the
# p-value
k = -1
for i in range[n_cols]:
for j in range[i]:
n = 0
for row in range[n_rows]:
n += 1 if mask[row, i] and mask[row, j] else 0
k += 1
idx[k] = [i, j]
correl[k] = corr[i,j]
count[k] = n
# The p-value can be obtained with the incomplete Beta function [betainc]
# We just need to massage the input a little
alpha = count / 2 - 1
x = [correl + 1] / 2
x = np.where[correl < 0, x, 1 - x]
p = 2 * scipy.special.betainc[alpha, alpha, x]
return idx, correl, p
# Manipulate the return values into the right format
index, corr, p = corr1[df]
idx = pd.MultiIndex.from_tuples[
[[df.columns[i], df.columns[j]] for i, j in index] +
[[df.columns[j], df.columns[i]] for i, j in index]
]
full_index = pd.MultiIndex.from_product[[df.columns, df.columns]]
result = pd.DataFrame[{
'corr': np.tile[corr, 2],
'p': np.tile[p, 2]
}, index=idx].reindex[full_index].unstack[]
Solution 2
For the absolutely fastest solution, you would have to write it in Cython. This brings down the execution time from 30 to 5 seconds. Am I sure further optimization is possible [but I'm too lazy to explore them]. The trade off is a more complex build and deployment process.
First, make sure you have a C compiler. Then install the Cython package:
pip install cython
Next, make 2 files: setup.py
and utility.pyx
:
#################################################
# setup.py
#################################################
from distutils.core import setup, Extension
from Cython.Build import cythonize
import numpy
compiler_directives = {
'language_level': '3'
}
setup[
ext_modules=cythonize["utility.pyx", compiler_directives=compiler_directives],
include_dirs=[numpy.get_include[]]
]
#################################################
# utility.pyx
#################################################
import cython
from cython import Py_ssize_t
import numpy as np
from numpy cimport [
ndarray,
float64_t,
uint8_t,
uint64_t,
]
import scipy.special
@cython.boundscheck[False]
@cython.wraparound[False]
def corr2[object df]:
# These variables go into the `nogil` context [i.e. into C land] so they
# must be statically typed
cdef:
Py_ssize_t n_rows, n_cols, i, j, row, n, k
ndarray[uint8_t, ndim=2] mask
ndarray[float64_t, ndim=2] corr
#
ndarray[uint64_t, ndim=2] idx
ndarray[float64_t, ndim=1] correl
ndarray[uint64_t, ndim=1] count
# We are still in Python land and thus have full access of all functions in
# numpy and pandas. Converting pandas dataframes to a 2D numpy array
# gives a huge speed boost
mask = df.notna[].to_numpy[dtype='uint8']
corr = df.corr[].to_numpy[]
n_rows, n_cols = df.shape
# Initialize the return arrays
length = int[n_cols * [n_cols - 1] / 2]
idx = np.empty[[length, 2], dtype=np.uint64]
correl = np.empty[length, dtype=np.float64]
count = np.empty[length, dtype=np.uint64]
with nogil:
# It's C-land in here. Everything must be statically defined
k = -1
# For 2-combination of columns, let `n` be the number of pairs whose
# elements are all non-NaN. We will need that later to calculate the
# p-value
for i in range[n_cols]:
for j in range[i]:
n = 0
for row in range[n_rows]:
n += 1 if mask[row, i] and mask[row, j] else 0
k += 1
idx[k, 0] = i
idx[k, 1] = j
correl[k] = corr[i,j]
count[k] = n
# Back to Python-land
# The p-value can be obtained with the incomplete Beta function [betainc]
# We just need to massage the input a little
alpha = count / 2 - 1
x = [correl + 1] / 2 # since -1