In machine learning, the trained model will not work properly without the normalization of data because the range of raw data varies widely. If you don’t normalize the data, the model will be dominated by the variables that use a larger scale, adversely affecting model performance. This makes it imperative to normalize the data. Using Min-Max Scaling you can normalize the range of independent data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing step.
In this guide, we’ll use a simple Height Weight data set from Kaggle. It contains only the height in inches and weights in pounds of 25,000 different humans of 18 years of age. This dataset can be used to build a model that can predict the heights or weights of a human.
Let’s begin by looking at the summary of the variables, using the describe() function.
The output above confirms that the numerical variables have different units and scales, for example, ‘Height’ in inches and ‘Weight’ in pounds. These differences can unduly influence the model and, therefore, therefore, the range of all features should be normalized so that each feature contributes approximately proportionately.
Also known as min-max scaling, is the simplest and consists method in rescaling. The range of features to scale in [0, 1] or [−1, 1]. The impact is that we end up with smaller standard deviations, which can suppress the effect of outliers. Selecting the target range depends on the nature of the data. The general formula for a min-max of [0, 1] is given as:
where X is an original value, x’ is the normalized value.suppose that we have weights span [140 pounds, 180 pounds]. To rescale this data, we first subtract 140 from each weight and divide the result by 40 (the difference between the maximum and minimum weights).
To rescale a range between an arbitrary set of values [a, b], the formula becomes:
In this guide, you will learn how to perform Min-Max normalization using sklearn minmaxscaler function. It transforms features by scaling each feature into a given range.
from sklearn.preprocessing import minmax_scale import pandas as pd hw_scaled = minmax_scale(hw_df[['Height(Inches)','Weight(Pounds)']], feature_range=(0,1)) hw_df['Height(Norm)']=hw_scaled[:,0] hw_df['Weight(Norm)']=hw_scaled[:,1]
This estimator scales and translates each feature individually such that it is in the given range on the training set, e.g. between zero and one.
The output above shows that all the values have been scaled between 0 to 1.
The distributions don’t look much different from their original distributions seen in histogram plots.