Detect and Remove Outliers from Pandas DataFrame

An outlier is an extremely high or extremely low value in the dataset. Let’s look at some data and see how this works. I have a list of Prices.

80,71,79,61,78,73,77,74,76,75,160,79,80,78,75,78,86,80, 82,69,100,72,74,75,180,72,71,12

All the numbers are in the range of 70-86 except number 4. That’s our outlier because it is nowhere near the other numbers. This can be just a typing mistake or it is showing the variance in your data.

These outliers can skew and mislead the training process of machine learning resulting in, less accurate and longer training times and poorer results.

Univariate Vs Multivariate.

In univariate outliers, we look distribution of a value in a single feature space. Multivariate outliers can be found in an n-dimensional space (of n-features). Looking at distributions in n-dimensional spaces can be very difficult for the human brain.

In addition to just something extremely high or low, you want to make sure that it satisfies the criteria. We have plenty of methods in statistics to discover outliers, but we will only be discussing Z-Score and IQR. we will also try to see the visualization of Outliers using Box-Plot.

Seaborn and Scipy have easy-to-use functions and classes for easy implementation along with Pandas and Numpy.

import numpy as np

import pandas as pd   
import seaborn as sns
from scipy import stats

price=np.array([80,71,79,61,78,73,77,74,76,75,160,79,80,78,75,78,86,80,75,82,69,100,72,74,75,180,72,71,120])
price_df=pd.DataFrame(price)
price_df.columns=['price']

Visualize Outliers using Box Plot

Box Plot graphically depicting groups of numerical data through their quartiles. Lines extending vertically from the boxes indicate variability outside the upper and lower quartiles. Outliers may be plotted as individual points.

sns.boxplot(x=price_df['price'])

The above plot shows three points between 100 to 180, these are outliers as there are not included in the box of observation i.e. nowhere near the quartiles.

Box plot uses the IQR method to display data and outliers(shape of the data) but in order to get a list of an outlier, we will need to use the mathematical formula and retrieve the outlier data.

Z-Score

Z-score re-scale and center(Normalize) the data and look for data points that are too far from zero(center). Data points far from zero will be treated as outliers. In most of the cases, a threshold of 3 or -3 is used i.e. if the Z-score value is greater than or less than 3 or -3 respectively, that data point will be identified as outliers.

We will use the Z-score function defined in Scipy library to detect the outliers.

z=np.abs(stats.zscore(price_df))
print(z)

 [[0.12826865] [0.48742088] [0.16817446] [0.88647891] [0.20808026] [0.40760928] [0.24798606] [0.36770347] [0.28789187] [0.32779767] [3.0641956 ] [0.16817446] [0.12826865] [0.20808026] [0.32779767] [0.20808026] [0.11116617] [0.12826865] [0.32779767] [0.04845705] [0.56723249] [0.66984741] [0.44751508] [0.36770347] [0.32779767] [3.86231166] [0.44751508] [0.48742088]]

It is difficult to say which data point is an outlier. Let’s try and define a threshold to identify an outlier.

print(np.where(z > 3))

(array([10, 25]), array([0, 0]))

The first array contains the list of row numbers and the second array respective column numbers, which mean z[10][0] has a Z-score higher than 3.

Remove Outliers

Now we want to remove outliers and clean data. This can be done with just one line code as we have already calculated the Z-score.

z_price=price_df[(z < 3).all(axis=1)]
price_df.shape,z_price['price'].shape

((29, 1), (27,))

Interquartile Range(IQR)

The IQR measure of variability is based on dividing a data set into quartiles called the first, second, and third quartiles; and they are denoted by Q1, Q2, and Q3, respectively.

Q1 is the middle value in the first half.
Q2 is the median value in the set.
Q3 is the middle value in the second half.

Q1=price_df.quantile(0.25)
Q3=price_df.quantile(0.75)
IQR=Q3-Q1
lowqe_bound=Q1 - 1.5 * IQR
upper_bound=Q3 + 1.5 * IQR
print(lowqe_bound,upper_bound)

IQR is similar to Z-score in terms of finding the distribution of data and then keeping some threshold to identify the outlier.

Just like Z-score, we can use previously calculated IQR scores to filter out the outliers by keeping only valid values.

IQR_price = price_df[~((price_df < lowqe_bound) |(price_df > upper_bound)).any(axis=1)]
IQR_price.shape   #(24, 1)

There is no precise way to define and identify outliers in general because of the specifics of each dataset. Instead, you are a domain expert. You must interpret the raw observations and decide whether a value is an outlier or not.