In this short guide, I’ll show you how to create a Correlation Matrix using Pandas. I’ll also review the steps to display the matrix using Seaborn and Matplotlib.
To start, here is a template that you can apply in order to create a correlation matrix using pandas:
df.corr[]
Next, I’ll show you an example with the steps to create a correlation matrix for a given dataset.
Step 1: Collect the Data
Firstly, collect the data that will be used for the correlation matrix.
For example, I collected the following data about 3 variables:
A | B | C |
45 | 38 | 10 |
37 | 31 | 15 |
42 | 26 | 17 |
35 | 28 | 21 |
39 | 33 | 12 |
Step 2: Create a DataFrame using Pandas
Next, create a DataFrame in order to capture the above dataset in Python:
import pandas as pd data = {'A': [45,37,42,35,39], 'B': [38,31,26,28,33], 'C': [10,15,17,21,12] } df = pd.DataFrame[data,columns=['A','B','C']] print [df]
Once you run the code, you’ll get the following DataFrame:
Step 3: Create a Correlation Matrix using Pandas
Now, create a correlation matrix using this template:
df.corr[]
This is the complete Python code that you can use to create the correlation matrix for our example:
import pandas as pd data = {'A': [45,37,42,35,39], 'B': [38,31,26,28,33], 'C': [10,15,17,21,12] } df = pd.DataFrame[data,columns=['A','B','C']] corrMatrix = df.corr[] print [corrMatrix]
Run the code in Python, and you’ll get the following matrix:
Step 4 [optional]: Get a Visual Representation of the Correlation Matrix using Seaborn and Matplotlib
You can use the seaborn and matplotlib packages in order to get a visual representation of the correlation matrix.
First import the seaborn and matplotlib packages:
import seaborn as sn import matplotlib.pyplot as plt
Then, add the following syntax at the bottom of the code:
sn.heatmap[corrMatrix, annot=True] plt.show[]
So the complete Python code would look like this:
import pandas as pd import seaborn as sn import matplotlib.pyplot as plt data = {'A': [45,37,42,35,39], 'B': [38,31,26,28,33], 'C': [10,15,17,21,12] } df = pd.DataFrame[data,columns=['A','B','C']] corrMatrix = df.corr[] sn.heatmap[corrMatrix, annot=True] plt.show[]
Run the code, and you’ll get the following correlation matrix:
That’s it! You may also want to review the following source that explains the steps to create a Confusion Matrix using Python. Alternatively, you may check this guide about creating a Covariance Matrix in Python.
Surprised to see no one mentioned more capable, interactive and easier to use alternatives.
A] You can use plotly:
Just two lines and you get:
interactivity,
smooth scale,
colors based on whole dataframe instead of individual columns,
column names & row indices on axes,
zooming in,
panning,
built-in one-click ability to save it as a PNG format,
auto-scaling,
comparison on hovering,
bubbles showing values so heatmap still looks good and you can see values wherever you want:
import plotly.express as px
fig = px.imshow[df.corr[]]
fig.show[]
B] You can also use Bokeh:
All the same functionality with a tad much hassle. But still worth it if you do not want to opt-in for plotly and still want all these things:
from bokeh.plotting import figure, show, output_notebook
from bokeh.models import ColumnDataSource, LinearColorMapper
from bokeh.transform import transform
output_notebook[]
colors = ['#d7191c', '#fdae61', '#ffffbf', '#a6d96a', '#1a9641']
TOOLS = "hover,save,pan,box_zoom,reset,wheel_zoom"
data = df.corr[].stack[].rename["value"].reset_index[]
p = figure[x_range=list[df.columns], y_range=list[df.index], tools=TOOLS, toolbar_location='below',
tooltips=[['Row, Column', '@level_0 x @level_1'], ['value', '@value']], height = 500, width = 500]
p.rect[x="level_1", y="level_0", width=1, height=1,
source=data,
fill_color={'field': 'value', 'transform': LinearColorMapper[palette=colors, low=data.value.min[], high=data.value.max[]]},
line_color=None]
color_bar = ColorBar[color_mapper=LinearColorMapper[palette=colors, low=data.value.min[], high=data.value.max[]], major_label_text_font_size="7px",
ticker=BasicTicker[desired_num_ticks=len[colors]],
formatter=PrintfTickFormatter[format="%f"],
label_standoff=6, border_line_color=None, location=[0, 0]]
p.add_layout[color_bar, 'right']
show[p]