Hướng dẫn calculate p-value python pandas

Solving your problem requires both math and programming. Since df.corr returns pretty quickly for you case, i will focus on the p-value instead:

Programming

scipy.stats.pearsonr[col_x, col_y] does not like dealing with NaN. So for each pair of columns, you have to remove all rows where one or both of the elements are NaN. You have 550 columns, hence 550 * 549 / 2 = 150,975 pairs. You better make sure your loop is extremely efficient.

If you view its source code, DataFrame.corr does it so blisteringly fast for 2 reasons:

It's coded in Cython and ran outside the Global Interpreter Lock [GIL]. That means the looping is in bare-metal C and hence very fast
It implements its own variance algorithm [Welford's method] and does not rely on scipy.stats. The complexity of this algorithm is O[n * m^2] where n is the number of rows and m is the number of columns.

Math

The pearsonr documentation provides a description on how the p-value is calculated:

r = 
dist = scipy.stats.beta[n/2 - 1, n/2 - 1, loc=-1, scale=2]
p = 2 * dist.cdf[-abs[r]]

Wikipedia gives us the formula for the CDF of the Beta distribution:

cdf[x, alpha, beta] = B[x, alpha, beta] / B[alpha, beta]
                    = scipy.special.betainc[alpha, beta, x]

Fortunately for us, the betainc function is vectorized so if we pass in 3 arrays of the same length as parameters, it will return an array as the output.

Solution 1

This solution is in native Python provide a reasonable performance on your dataset [500 * 550]. Took about 30 seconds on my 2014 iMac with 16GB of RAM:

import scipy.special

def corr1[df]:
    mask = df.notna[].to_numpy[]
    corr = df.corr[].to_numpy[]
    n_rows, n_cols = df.shape

    # Initialize the return arrays for better performance
    length = int[n_cols * [n_cols - 1] / 2]
    idx = np.empty[[length, 2], dtype=object]
    correl = np.empty[length, dtype=np.float64]
    count = np.empty[length, dtype=np.uint64]

    # For 2-combination of columns, let `n` be the number of pairs whose
    # elements are all non-NaN. We will need that later to calculate the
    # p-value
    k = -1
    for i in range[n_cols]:
        for j in range[i]:
            n = 0
            for row in range[n_rows]:
                n += 1 if mask[row, i] and mask[row, j] else 0

            k += 1
            idx[k] = [i, j]
            correl[k] = corr[i,j]
            count[k] = n

    # The p-value can be obtained with the incomplete Beta function [betainc]
    # We just need to massage the input a little
    alpha = count / 2 - 1
    x = [correl + 1] / 2
    x = np.where[correl < 0, x, 1 - x]
    p = 2 * scipy.special.betainc[alpha, alpha, x]
    
    return idx, correl, p

# Manipulate the return values into the right format
index, corr, p = corr1[df]

idx = pd.MultiIndex.from_tuples[
    [[df.columns[i], df.columns[j]] for i, j in index] +
    [[df.columns[j], df.columns[i]] for i, j in index]
]
full_index = pd.MultiIndex.from_product[[df.columns, df.columns]]
result = pd.DataFrame[{
    'corr': np.tile[corr, 2],
    'p': np.tile[p, 2]
}, index=idx].reindex[full_index].unstack[]

Solution 2

For the absolutely fastest solution, you would have to write it in Cython. This brings down the execution time from 30 to 5 seconds. Am I sure further optimization is possible [but I'm too lazy to explore them]. The trade off is a more complex build and deployment process.

First, make sure you have a C compiler. Then install the Cython package:

pip install cython

Next, make 2 files: setup.py and utility.pyx:

#################################################
# setup.py
#################################################
from distutils.core import setup, Extension
from Cython.Build import cythonize
import numpy

compiler_directives = {
    'language_level': '3'
}
setup[
    ext_modules=cythonize["utility.pyx", compiler_directives=compiler_directives],
    include_dirs=[numpy.get_include[]]
]

#################################################
# utility.pyx
#################################################
import cython
from cython import Py_ssize_t

import numpy as np
from numpy cimport [
    ndarray,
    float64_t,
    uint8_t,
    uint64_t,
]

import scipy.special

@cython.boundscheck[False]
@cython.wraparound[False]
def corr2[object df]:
    # These variables go into the `nogil` context [i.e. into C land] so they
    # must be statically typed
    cdef:
        Py_ssize_t n_rows, n_cols, i, j, row, n, k
        ndarray[uint8_t, ndim=2] mask
        ndarray[float64_t, ndim=2] corr
        #
        ndarray[uint64_t, ndim=2] idx
        ndarray[float64_t, ndim=1] correl
        ndarray[uint64_t, ndim=1] count

    # We are still in Python land and thus have full access of all functions in
    # numpy and pandas. Converting pandas dataframes to a 2D numpy array
    # gives a huge speed boost
    mask = df.notna[].to_numpy[dtype='uint8']
    corr = df.corr[].to_numpy[]
    n_rows, n_cols = df.shape

    # Initialize the return arrays
    length = int[n_cols * [n_cols - 1] / 2]
    idx = np.empty[[length, 2], dtype=np.uint64]
    correl = np.empty[length, dtype=np.float64]
    count = np.empty[length, dtype=np.uint64]

    with nogil:
        # It's C-land in here. Everything must be statically defined

        k = -1
        # For 2-combination of columns, let `n` be the number of pairs whose
        # elements are all non-NaN. We will need that later to calculate the
        # p-value
        for i in range[n_cols]:
            for j in range[i]:
                n = 0
                for row in range[n_rows]:
                    n += 1 if mask[row, i] and mask[row, j] else 0
            
                k += 1
                idx[k, 0] = i
                idx[k, 1] = j
                correl[k] = corr[i,j]
                count[k] = n
    
    # Back to Python-land
    # The p-value can be obtained with the incomplete Beta function [betainc]
    # We just need to massage the input a little
    alpha = count / 2 - 1
    x = [correl + 1] / 2                # since -1


				
					

                 
	Bài Viết Liên Quan
	
	 	
		
		   
		   
		   
		
		
			Hướng dẫn dùng picture uploader trong PHP

		
	

		
		
		   
		   
		   
		
		
			Css nth-child from 1 to 5

		
	

		
		
		   
		   
		   
		
		
			Xổ số 9/3/2023

		
	

		
		
		   
		   
		   
		
		
			Hướng dẫn dấu tích trong html

		
	

		
		
		   
		   
		   
		
		
			How to remove spaces between decimals and units in python

		
	

		
		
		   
		   
		   
		
		
			Hướng dẫn php market share 2022

		
	

		
		
		   
		   
		   
		
		
			Hướng dẫn php exec in background

		
	

		
		
		   
		   
		   
		
		
			What is class abstract class and interface in php?

		
	

		
		
		   
		   
		   
		
		
			Hướng dẫn dùng die fun trong PHP

		
	

		
		
		   
		   
		   
		
		
			Hướng dẫn php const array

		
	

		
		
		   
		   
		   
		
		
			Hướng dẫn dùng php3 trong PHP

		
	

		
		
		   
		   
		   
		
		
			Can you have two for loops in python?

		
	

		
		
		   
		   
		   
		
		
			Get parameter from url php

		
	

		
		
		   
		   
		   
		
		
			Set which python to use

		
	

		
		
		   
		   
		   
		
		
			File lịch de bàn 2023

		
	

		
		
		   
		   
		   
		
		
			Hướng dẫn hangman python stack overflow

		
	

		
		
		   
		   
		   
		
		
			Python read log file continuously

		
	

		
		
		   
		   
		   
		
		
			Hướng dẫn dùng command brand python

		
	

		
		
		   
		   
		   
		
		
			What is password in javascript?

		
	

		
		
		   
		   
		   
		
		
			How to remove letter from string in php?

		
	

	
	




Toplist mới

 
	
	 
		#1
		
			Top 9 tập bản đồ lớp 8 bài 31 2023
			5 tháng trước
		
	



	
	 
		#2
		
			Top 6 kết quả thi hsg đà nẵng 2022 2023
			5 tháng trước
		
	



	
	 
		#3
		
			Top 9 tủ nhựa đài loan 4 cánh 3d 2023
			5 tháng trước
		
	



	
	 
		#4
		
			Top 9 chất khí có thể làm mất màu dung dịch nước brom là: a. so2. b. co2. c. o2. d. hcl. 2023
			5 tháng trước
		
	



	
	 
		#5
		
			Top 8 tìm việc làm tiện, phay bảo q7 2023
			5 tháng trước
		
	



	
	 
		#6
		
			Top 3 tôi xuyên thành tiểu kiều the của lão đại phản 2 2023
			5 tháng trước
		
	



	
	 
		#7
		
			Top 9 đổi mới phong cách, thái độ phục vụ của cán bộ y tế hướng tới sự hài lòng của người bệnh 2023
			5 tháng trước
		
	



	
	 
		#8
		
			Top 2 bài the dục phát triển chung lớp 6 2022 2023
			5 tháng trước
		
	



	
	 
		#9
		
			Top 3 bài giảng vũ điệu sắc màu (lớp 4) 2023
			5 tháng trước
		
	






		


	Bài mới nhất
	
	 	
		
		   
		   
		   
		
		
			Camera điều khiển tren không gọi là gì năm 2024

		
	

		
		
		   
		   
		   
		
		
			Ký hiệu ext trên tay cầm ps4 pro là gì năm 2024

		
	

		
		
		   
		   
		   
		
		
			Nghĩa vụ cơ bản của hợp đồng là gì năm 2024

		
	

		
		
		   
		   
		   
		
		
			Dùng ram như thế nào cho máy tính xem phim năm 2024

		
	

		
		
		   
		   
		   
		
		
			Qua tiêu hóa lipid sẽ được biến đổi thành gì năm 2024

		
	

		
		
		   
		   
		   
		
		
			Anh không có font nền gọi là gì năm 2024

		
	

		
		
		   
		   
		   
		
		
			Lỗi page fault in nonpaged area win 7 năm 2024

		
	

		
		
		   
		   
		   
		
		
			Bài tập định luật bôi lơ ma ri ốt năm 2024

		
	

		
		
		   
		   
		   
		
		
			Quỹ đầu tư pdf là viết tắt của từ gì năm 2024

		
	

		
		
		   
		   
		   
		
		
			Top 100 gương mặt nữ đẹp nhất thế giới 2023 năm 2024

		
	

	
	
                 
	Chủ Đề
	
	
	
		  programming
		  Hỏi Đáp
		  Toplist
		  Mẹo Hay
		  Là gì
		  Địa Điểm Hay
		  Học Tốt
		  mẹo hay
		  Công Nghệ
		  Nghĩa của từ
		  Bao nhiêu
		  Khỏe Đẹp
		  đánh giá
		  Top List
		  bao nhieu
		  bao nhiêu
		  hướng dẫn
		  Xây Đựng
		  So Sánh
		  Bài tập
		  So sánh
		  Tiếng anh
		  Sản phẩm tốt
		  Ngôn ngữ
		  Bài Tập
		  javascript
		  Ở đâu
		  Thế nào
		  Hướng dẫn
		  Dịch 
		  Máy tính
		  Tại sao
		  Đại học
		  Món Ngon
		  Khoa Học