Some regression models have a large number of features that can slow the development and training of models and require a large amount of system memory. It is desirable to reduce the number of features to reduce the computational cost and improve the performance of the model.

You can evaluate the relationship between each feature and target using a correlation and selecting those features that have the strongest relationship with the target variable. 

The difference has to do with whether features are selected based on the target variable or not. Unsupervised feature selection techniques ignore the target variable, such as methods that remove redundant variables using correlation.

The correlation matrix is a square matrix that contains the Pearson product-moment correlation coefficient (often abbreviated as Pearson’s r), which measures the linear dependence between pairs of features. The correlation coefficients are in the range –1 to 1. Two features have a perfect positive correlation if r = 1, no correlation if r = 0, and a perfect negative correlation if r = –1.

It is common to use correlation-type statistical measures between input and output variables as the basis for filter feature selection.

In the following code example, we will use Pandas’ corr() function on the 10 feature columns then we will use seaborn’s heatmap() function to plot the correlation matrix array as a heat map.

For this tutorial, we will use the California housing prices dataset. The dataset corresponds to regression tasks on which you need to predict house prices based on 10 features. There are a total of 20640 observations in the dataset. Your first task is to load the dataset so that you can proceed.  

import pandas as pd
import seaborn as sns

Housing prices dataset

Statistical measures are highly dependent upon the variable data types. Common data types include numerical and categorical, although each may be further subdivided such as integer and floating-point for numerical variables, and boolean, ordinal, or nominal for categorical variables. Now to get a correlation we need to convert our categorical features to numerical ones.

df['ocean_proximity'] =df['ocean_proximity'].astype('category')


Next, we will create a correlation matrix to quantify and summarize linear relationships between variables. A correlation matrix is closely related to the covariance matrix. We can interpret the correlation matrix as being a rescaled version of the covariance matrix. In fact, the correlation matrix is identical to a covariance matrix computed from standardized features.

pandas correlation matrix
Correlation helps us to identify patterns in data based on the correlation between features.

We can see longitude and latitude has a negative correlation with the median_house_value, whereas median_income has a positive correlation (approximately 0.68). Note that a value of 1 describes a perfect positive correlation whereas a value of –1 corresponds to a perfect negative correlation.

Correlation matrix heatmap

The correlation matrix provides us with another useful summary graphic that can help us to select features based on their respective linear correlations:

sns.set(rc = {'figure.figsize':(16,8)})
sns.heatmap(df.corr(), annot = True, fmt='.2g',cmap= 'coolwarm')
correlation heatmap

To fit a linear regression model, we are interested in those features that have a high correlation with our target variable median_house_value. Looking at the previous correlation matrix, we can see that median_house_value shows the largest correlation with the median_income ( 0.68 ), which seems to be a good choice for an exploratory variable to introduce the concepts of a simple linear regression model.

If features are highly correlated, matrices are computationally difficult to invert, which can lead to numerically unstable estimates. To reduce the correlation among variables, we can simply remove one feature column. Note that we do not lose any important information by removing a feature column.

Related Post