Many machine learning models are designed with the assumption that each feature values close to zero or all features vary on comparable scales. The gradient-based model assumes standardized data. Before we code any Machine Learning algorithm, the first thing we need to do is to put our data in a format that the algorithm will want.
This example uses MinMaxScaler, StandardScaler to normalize and preprocess data for machine learning and bring the data within a pre-defined range.
DataSet
In this tutorial, we will use the California housing dataset. We’ve reduced the number of input features to make visualization easier.
housing_train_all=pd.read_csv('sample_data/california_housing_train.csv')
housing_df=housing_train_all[['total_rooms','population','median_income']]
import seaborn as sb
sb.kdeplot(housing_df['total_rooms'])
sb.kdeplot(housing_train_all['population'])
# sb.kdeplot(housing_df['median_income'])
The values are relatively similar scale, as can be seen on the X-axis of the kdeplot
below.

Then I added a third distribution with much larger values.
sb.kdeplot(housing_df['total_rooms'])
sb.kdeplot(housing_train_all['population'])
sb.kdeplot(housing_df['median_income'])
Now our kdeplot
looks like this:

Squint hard at the monitor and you might notice the tiny Orange bar of big values to the right. Here are the descriptive statistics for our features.

The median income
and Total room
of the California housing dataset have very different scales. These characteristics lead to difficulties to visualize the data and, more importantly, they can degrade the predictive performance of machine learning algorithms. Unscaled data can also slow down or even prevent the convergence of many gradient-based estimators.
Scaling
Scale means changing the range of the feature‘s values. The shape of the distribution doesn’t change. Think about the scale model of a building that has the same proportions as the original, just smaller(The scale range is set at 0 to 1)
MinMaxScaler subtracts the minimum value in the feature and then divides it by the range(the difference between the original maximum and original minimum).
housing_df_min_max_scale= pd.DataFrame(MinMaxScaler().fit_transform(housing_df))
sb.kdeplot(housing_df_min_max_scale[0])
sb.kdeplot(housing_df_min_max_scale[1])
sb.kdeplot(housing_df_min_max_scale[2])
Note that MinMaxScaler
doesn’t reduce the importance of outliers.

Notice how the features are all on the same relative scale. The relative spaces between each feature’s values have been maintained. It rescales the data set such that all feature values are in the range [0, 1]
as shown in the above plot.
Standardization
Standardize generally means changing the values so that the distribution is centered around 0, with a standard deviation of 1. It outputs something very close to a normal distribution.
housing_df_standard_scale=pd.DataFrame(StandardScaler().fit_transform(housing_df))
sb.kdeplot(housing_df_standard_scale[0])
sb.kdeplot(housing_df_standard_scale[1])
sb.kdeplot(housing_df_standard_scale[2])
StandardScaler
standardizes a feature by subtracting the mean and then scaling to unit variance. Unit variance means dividing all the values by the standard deviation.

In the plot above, you can see that all four distributions have a mean close to zero and unit variance. The values are on a similar scale, but the range is larger than after MinMaxScaler.
StandardScaler does distort the relative distances between the feature values. The outliers have an influence when computing the empirical mean and standard deviation which shrinks the range of the feature values. The outliers on each feature have different magnitudes, and the spread of the transformed data on each feature is very different:
StandardScaler cannot guarantee balanced feature scales in the presence of outliers.
Column Wise Scaling
If you have mixed-type columns in a pandas’ data frame and you’d like to apply sklearn’s scaler to some of the columns. The following code works for selected column scaling:
scaler.fit_transform(df[['total_rooms','population']])
The outer brackets are selector brackets, telling pandas to select a column from the DataFrame. The inner brackets indicate a list. You’re passing a list to the pandas’ selector.
Related Post
Normalize PyTorch batch of tensors between 0 and 1 using scikit-learn MinMaxScaler
Normalize Image Dataset in PyTorch using transforms.Normalize()
How to Scale data into the 0-1 range using Min-Max Normalization.
How to normalize, mean subtraction, standard deviation, and zero center image dataset in Python