An outlier is an extremely high or extremely low value in the dataset. Let’s look at some data and see how this works. I have a list of Price.
All the numbers in the range of 70-86 except number 4. That’s our outlier because it is nowhere near to the other numbers. This can be just a typing mistake or it is showing the variance in your data.
These outliers can skew and mislead the training process of machine learning resulting in, less accurate and longer training times and poorer results.
Univariate Vs Multivariate.
In univariate outliers, we look distribution of a value in a single feature space. Multivariate outliers can be found in an n-dimensional space (of n-features). Looking at distributions in n-dimensional spaces can be very difficult for the human brain.
In addition to just something extremely high or low, you want to make sure that it satisfies the criteria. We have plenty of methods in statistics to the discovery outliers, but we will only be discussing Z-Score and IQR. we will also try to see the visualization of Outliers using Box-Plot.
Seaborn and Scipy have easy to use functions and classes for an easy implementation along with Pandas and Numpy.
import numpy as np import pandas as pd import seaborn as sns from scipy import stats price=np.array([80,71,79,61,78,73,77,74,76,75,160,79,80,78,75,78,86,80,75,82,69,100,72,74,75,180,72,71,120]) price_df=pd.DataFrame(price) price_df.columns=['price']
Visualize Outliers using Box Plot
Box Plot graphically depicting groups of numerical data through their quartiles. Lines extending vertically from the boxes indicating variability outside the upper and lower quartiles. Outliers may be plotted as individual points.
The above plot shows three points between 100 to 180, these are outliers as there are not included in the box of observation i.e nowhere near the quartiles.
Box plot uses the IQR method to display data and outliers(shape of the data) but in order to get a list of an outlier, we will need to use the mathematical formula and retrieve the outlier data.
Z-score re-scale and center(Normalize) the data and look for data points which are too far from zero(center). Data points far from zero will be treated as the outliers. In most of the cases, a threshold of 3 or -3 is used i.e if the Z-score value is greater than or less than 3 or -3 respectively, that data point will be identified as outliers.
We will use the Z-score function defined in scipy library to detect the outliers.
[[0.12826865] [0.48742088] [0.16817446] [0.88647891] [0.20808026] [0.40760928] [0.24798606] [0.36770347] [0.28789187] [0.32779767] [3.0641956 ] [0.16817446] [0.12826865] [0.20808026] [0.32779767] [0.20808026] [0.11116617] [0.12826865] [0.32779767] [0.04845705] [0.56723249] [0.66984741] [0.44751508] [0.36770347] [0.32779767] [3.86231166] [0.44751508] [0.48742088]]
It is difficult to say which data point is an outlier. Let’s try and define a threshold to identify an outlier.
print(np.where(z > 3))
(array([10, 25]), array([0, 0]))
The first array contains the list of row numbers and second array respective column numbers, which mean z have a Z-score higher than 3.
Now we want to remove outliers and clean data. This can be done with just one line code as we have already calculated the Z-score.
z_price=price_df[(z < 3).all(axis=1)] price_df.shape,z_price['price'].shape
((29, 1), (27,))
The IQR measure of variability, based on dividing a data set into quartiles called the first, second, and third quartiles; and they are denoted by Q1, Q2, and Q3, respectively.
- Q1 is the middle value in the first half.
- Q2 is the median value in the set.
- Q3 is the middle value in the second half.
Q1=price_df.quantile(0.25) Q3=price_df.quantile(0.75) IQR=Q3-Q1 lowqe_bound=Q1 - 1.5 * IQR upper_bound=Q3 + 1.5 * IQR print(lowqe_bound,upper_bound)
IQR is similar to Z-score in terms of finding the distribution of data and then keeping some threshold to identify the outlier.
Just like Z-score we can use previously calculated IQR scores to filter out the outliers by keeping only valid values.
IQR_price = price_df[~((price_df < lowqe_bound) |(price_df > upper_bound)).any(axis=1)] IQR_price.shape #(24, 1)
There is no precise way to define and identify outliers in general because of the specifics of each dataset. Instead, you are a domain expert. You must interpret the raw observations and decide whether a value is an outlier or not.