Many machine learning models are designed with the assumption that each feature values close to zero or all features vary on comparable scales. The gradient-based model assumes standardized data. Before we code any Machine Learning algorithm, the first thing we need to do is to put our data in a format that the algorithm will want.
In this tutorial, we will use the California housing dataset. We’ve reduced the number of input features to make visualization easier.
import seaborn as sb sb.kdeplot(housing_df['total_rooms']) sb.kdeplot(housing_train_all['population']) # sb.kdeplot(housing_df['median_income'])
The values are relatively similar scale, as can be seen on the X-axis of the
Then I added a third distribution with much larger values.
sb.kdeplot(housing_df['total_rooms']) sb.kdeplot(housing_train_all['population']) sb.kdeplot(housing_df['median_income'])
kdeplot looks like this:
Squint hard at the monitor and you might notice the tiny Orange bar of big values to the right. Here are the descriptive statistics for our features.
median income and
Total room of the California housing dataset have very different scales. These characteristics lead to difficulties to visualize the data and, more importantly, they can degrade the predictive performance of machine learning algorithms. Unscaled data can also slow down or even prevent the convergence of many gradient-based estimators.
Scale means changing the range of the feature‘s values. The shape of the distribution doesn’t change. Think about the scale model of a building that has the same proportions as the original, just smaller(The scale range is set at 0 to 1)
MinMaxScaler subtracts the minimum value in the feature and then divides it by the range(the difference between the original maximum and original minimum).
housing_df_min_max_scale= pd.DataFrame(MinMaxScaler().fit_transform(housing_df)) sb.kdeplot(housing_df_min_max_scale) sb.kdeplot(housing_df_min_max_scale) sb.kdeplot(housing_df_min_max_scale)
MinMaxScaler doesn’t reduce the importance of outliers.
Notice how the features are all on the same relative scale. The relative spaces between each feature’s values have been maintained. It rescales the data set such that all feature values are in the range
[0, 1] as shown in the above plot.
Standardize generally means changing the values so that the distribution is centered around 0, with a standard deviation of 1. It outputs something very close to a normal distribution.
housing_df_standard_scale=pd.DataFrame(StandardScaler().fit_transform(housing_df)) sb.kdeplot(housing_df_standard_scale) sb.kdeplot(housing_df_standard_scale) sb.kdeplot(housing_df_standard_scale)
StandardScaler standardizes a feature by subtracting the mean and then scaling to unit variance. Unit variance means dividing all the values by the standard deviation.
In the plot above, you can see that all four distributions have a mean close to zero and unit variance. The values are on a similar scale, but the range is larger than after MinMaxScaler.
StandardScaler does distort the relative distances between the feature values. The outliers have an influence when computing the empirical mean and standard deviation which shrinks the range of the feature values. The outliers on each feature have different magnitudes, and the spread of the transformed data on each feature is very different:
StandardScaler cannot guarantee balanced feature scales in the presence of outliers.
Column Wise Scaling
If you have mixed-type columns in a pandas’ data frame and you’d like to apply sklearn’s scaler to some of the columns. The following code works for selected column scaling:
The outer brackets are selector brackets, telling pandas to select a column from the DataFrame. The inner brackets indicate a list. You’re passing a list to the pandas’ selector.