When we’re training classifiers, standardization achieves superior results. We should note that the classifier requires all input data to be standardized before classification. In this tutorial, we will understand Scikit-learn ’s StandardScaler from scratch.

Dataset

In this tutorial, we will use the Heights and Weights Dataset from Kaggle. This is a simple dataset to start with. It contains only the height (inches) and weights (pounds) of 25,000 different humans of 18 years of age. This dataset can be used to build a model that can predict the heights or weights of a human.

Our weights depend strongly on the height. A taller person is likely to weigh more than a shorter person. We can treat the heights and weights as two-dimensional coordinates in a measurements matrix. Let’s plot these measured coordinates.

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

data=pd.read_csv("/content/SOCR-HeightWeight.csv")

data = data.drop("Index", axis=1)
data.columns #Index(['Height(Inches)', 'Weight(Pounds)'], dtype='object')

ax2 = data.plot.scatter(x='Height(Inches)',y='Weight(Pounds)')
Standardize Pandas Dataframe

The linear relationship between height and weight is visible in the plot. Also, as expected, the height and weight axes are scaled differently. Customers are on average 68 inches tall, we can eliminate these problems through standardization.

In the next section, we discuss why large X values impede performance. We’ll limit that impediment through a process called standardization, in which X is adjusted to equal (X - X.mean(axis=0)) / X.std(axis=0)

Improving Performance through Standardization 

Perceptron training is impeded by large feature values in X. This is due to the discrepancy between the coefficient shifts and bias shifts. The coefficient shift is proportional to the associated feature value. 

Furthermore, these values can be quite high. For instance, the average customer height is greater than 67 inches: the inches_coef shift is more than 60-fold higher than the bias shift, so we can’t tweak the bias by a little without tweaking the coefficients by a lot. Thus, by tuning the bias, we are liable to significantly shift inches_coef toward a less-than-optimal value. 

However, we can lower these shifts by reducing column means in matrix X. Additionally, we need to lower the dispersion in the matrix. Otherwise, unusually large customer measurements could cause overly large coefficient shifts. We therefore need to decrease the column means and standard deviations.

To start, let’s print the current values of X.mean(axis=0) and X.std(axis=0).

means_x = data["Height(Inches)"].mean(axis=0)
stds_x = data["Height(Inches)"].std(axis=0)

print(f"Mean values: {np.round(means_x, 2)}") #Mean values: 67.99
print(f"STD values: {np.round(stds_x, 2)}") #STD values: 1.9

The feature means and standard deviations are relatively high. How do we make them smaller? Well, it is trivial to shift a dataset’s mean toward zero: we simply need to subtract means from X. Adjusting the standard deviations is less straightforward, but mathematically we can show that (X - means) / stds returns a matrix whose column dispersions all equal 1.0. 

def standardize(X,means,stds):
    return (X - means) / stds

means_x = data["Height(Inches)"].mean(axis=0)
stds_x = data["Height(Inches)"].std(axis=0)

X = standardize(data["Height(Inches)"], means_x,stds_x)

means_y = data["Weight(Pounds)"].mean(axis=0)
stds_y = data["Weight(Pounds)"].std(axis=0)

Y = standardize(data["Weight(Pounds)"], means_y,stds_y)

Standardization is similar to normalization. Both techniques lower the values in inputted data and eliminate unit differences (such as inches versus centimeters). Scikit-learn includes its standardization class called StandardScaler. Here, we import and initialize that class.

from sklearn.preprocessing import StandardScaler
standard_scaler = StandardScaler().set_output(transform="pandas")

data_scale=standard_scaler.fit_transform(data)
data_scale.head(5)
Pandas StandardScaler DataFrame

Running standard_scaler.fit_transform() returns a standardized Pandas DataFrame or Matrix. The means of the matrix columns equal 0, and the standard deviations equal 1. Of course, the matrix is identical to our existing standardized matrix.

Standardize Dataframe using Standardscaler

The standard_scaler object has learned the means and standard deviations associated with our feature, so it can now standardize data based on these statistics.

Related Post

How to Scale Data into the 0-1 range using Min-Max Normalization?

Create Pandas DataFrame with Random Data

How to Normalize Categorical Variables?

Encoding Ordinal Categorical Features using OrdinalEncoder

Scaling Pandas DataFrame with MinMaxScaler

One Hot Encoding Pandas List Type Column Values.

Difference between LabelEncoder and OrdinalEncoder to Encode Categorical Values

How to normalize, mean subtraction, standard deviation, and zero center image dataset in Python?

Normalize, Scale, and Standardize Pandas DataFrame columns using Scikit-Learn