In machine learning applications, we can encounter various types of features: continuous, unordered categorical, and ordered categorical. 

Note that while numeric data can be either continuous or discrete, in the context of the machine learning libraries data specifically refers to continuous data of the floating point type. 

Real-world datasets contain one or more categorical feature columns. In this tutorial, we will make use of simple effective examples to see how to deal with this type of data in numerical computing libraries. Let’s create a new DataFrame to illustrate the problem: 

import pandas as pd

df = pd.DataFrame([['RED', 'M', 9.5],
                   ['Green', 'L', 12.8 ],
                   ['Blue', 'XL', 16.5]])

df.columns = ['Color', 'Size', 'Price']

When we are talking about categorical data, we have to further distinguish between ordinal and nominal features. Ordinal features can be understood as categorical values that can be sorted or ordered. For example, t-shirt size would be an ordinal feature, because we can define an order: S < M < L < XL

In contrast, nominal features don’t imply any order and, to continue with the previous example, we could think of t-shirt color as a nominal feature since it typically doesn’t make sense to say that, for example, red is larger than blue. 

Encoding Categorical Data with Pandas

As we can see in the preceding output, the newly created DataFrame contains a nominal feature (color), an ordinal feature (size), and a numerical feature (price) column. 

Mapping Ordinal Features

To make sure that the learning algorithm interprets the ordinal features correctly, we need to convert the categorical string values into integers. There is no convenient function that can automatically derive the correct order of the labels of our size feature, so we have to define the mapping manually.

size_mapping = {'XL': 3,'L': 2,'M': 1}
df['Size'] = df['Size'].map(size_mapping)
Pandas Normalize Categorical Data

Encoding Nominal Features 

Linear classifiers work poorly on raw data, standardization is required to achieve the best results. Linear models cannot handle categorical features without data preprocessing. 

The simplest way to represent categories is with numbers: 0 for Red, 1 for Green, and 2 for Blue. However, this representation will cause a linear model to fail. Although the color values don’t come in any particular order.

The learning algorithm will now assume that Blue is larger than Red. Although this assumption is incorrect, the algorithm could still produce useful results, those results would not be optimal. 

A common workaround for this problem is to use a technique called one-hot encoding. The idea behind this approach is to create a new dummy feature for each unique value in the nominal feature column. 

A more convenient way to create those dummy features via one-hot encoding is to use the get_dummies method implemented in pandas. Applied to a DataFrame, the get_dummies method will only convert string columns and leave all other columns unchanged:

pd.get_dummies(df[['Price', 'Color', 'Size']])
Encoding Categorical Data

Normalize Categorical Data

Normalization adjusts data to have a specific distributional characteristic. This could include a fixed range (between 0 and 1); fixed the mean or median to some constant value or fixed the variance or spread to some constant value.

Data normalization prevents data with big values from dominating the learning outcomes. As often happens, the dominant data is not categorical because it is one-hot encoded.

They are not conceptually scalar values, but booleans. Normalizing such features amounts to rescaling the positive value, which in many cases won’t do anything at all.

When you one-hot encode categorical variables they are either 0/1 hence there is not much scale difference like 10~1000 hence there is no need to apply techniques for normalization/standardization.

Finally, normalization/standardization does not affect the ordering of values. So if 𝑥1 is larger than 𝑥2, after normalization or standardization they both would have potentially different values, but the relation between them would not change.

Related Post

Standardize Pandas DataFrame Using StandardScaler

Scaling Pandas DataFrame with MinMaxScaler

Normalize, Scale, and Standardize Pandas DataFrame columns using Scikit-Learn

How to Scale Data into the 0-1 range using Min-Max Normalization?

Encoding Ordinal Categorical Features using OrdinalEncoder

How to Encode multiple columns using Scikit-learn?

One Hot Encoding Pandas List Type Column Values.

Difference between LabelEncoder and OrdinalEncoder to Encode Categorical Values

How to normalize, mean subtraction, standard deviation, and zero center image dataset in Python?