The correlation measures dependence between two variables. It also measures “how two variables move together” and “how strongly they have related” means the increase in one variable also an increase in another.It helps you get a deeper understanding of your data.
For example, sales might increase in the festival season, or a customer’s purchase on an e-commerce website might depend on a number of factors.
For illustration, I’ll use the Auto-mpg dataset, containing Mileage per gallon performances of various cars.
import numpy as np import pandas as pd auto_df=pd.read_csv('auto-mpg.csv')
Correlation Coefficient between two Columns
Correlation expressed in the form of a correlation coefficient. For example, the following data shows the number of Cylinders and the Displacement of cars.
Using the correlation coefficient you can find out how these two variables are related and to what degree. Please note that this is only a part of the whole dataset.
To calculate the correlation coefficient, selecting columns, and then applying the
.corr() method. We can compute the correlation pairwise between more than 2 columns.
cylinders displacement cylinders 1.000000 0.950721 displacement 0.950721 1.000000
In this way, we found the correlation coefficient between ‘Cylinders’ and ‘Displacement’ is 0.95. This value will change according to the size of the dataset.
Positive Correlation Coefficient
The correlation coefficient is measured on a scale from -1 to +1. A positive correlation coefficient means that there is a perfect positive relationship between the two variables.
Here Both features move together in the same direction. An increase in one is accompanied by an increase in the other.
Negative Correlation Coefficient
A correlation coefficient of (-) represents a perfect negative correlation. This means when one increases, the other decreases and vice-versa.
A value of 0, means that there is no correlation between the two and they are not related to each other at all.
In our example, we got a positive number for the correlation coefficient, which confirms that an increase in salary is in fact related to an increase in job satisfaction.
Making a correlation matrix is a great way to summarize all the data. In this way, you can pick the best features and use them for further processing your data.
Pandas’ DataFrame class has the method corr() that computes three different correlation coefficients. Using any of the following methods: Pearson correlation, Kendall Tau correlation, and Spearman correlation method. The correlation coefficients calculated using these methods vary from +1 to -1.
Below is a correlation matrix to find out which factors have the most effect on MPG. All the variables involved have been placed along with both the column header and the row header of the table. Correlation coefficients between each pair of variables have been calculated and placed at their intersections.
This matrix tells a lot about the relationships between the variables involved. You will find a correlation of 1.0 along the diagonal of the matrix. This is because each variable is highly and positively correlated with itself. You can also see the relationship between “mpg” and “weight” is -0.8. This means as Car weight increase, chances of car mpg decreases.
A good way to quickly check correlations among columns is by visualizing the correlation matrix as a heatmap.
Visualization is generally easier to understand than reading tabular data, heatmaps are typically used to visualize correlation matrices. A simple way to plot a heatmap in Python is by importing and implementing the Seaborn library.
import seaborn as sns sns.heatmap(auto_df.corr(), annot = True, fmt='.2g',cmap= 'coolwarm')
Dark red means positive, Blue means negative. The stronger the color, the larger the correlation magnitude.
Remove Correlated Features
Correlated features, in general, don’t improve models but they affect specific models in different ways and to varying extents. It is clear that correlated features means that they bring the same information, so it is logical to remove one of them.
corrMatrix=auto_df.corr().abs() upperMatrix = corr_matrix.where(np.triu(np.ones(corrMatrix.shape), k=1).astype(np.bool)) # Find index of feature columns with correlation greater than 0.90 corrFutures = [column for column in upperMatrix.columns if any(upperMatrix[column] > 0.90)] auto_df.drop(columns=corrFutures)