Identifying your variables as categorical or continuous has a direct impact on the performance of your ML model. Proper identification depends on the context and how you want to encode its meaning.

In machine learning, we classify numerical features into two categories: categorical and continuous. A continuous feature has infinite possibilities, such as the price of an item. 

A categorical feature has a discrete number, such as the month of the year (1 to 12). We can subdivide the categorical family into three main types: 

  • Binary (or dichotomous), when you have only two choices (0/1, true/false)
  • Ordinal when the categories have a certain ordering (e.g., the position in a race) that matters
  • Nominal when the categories have no specific ordering (e.g., the color of an item) 
Encoding Categorical Features
The different types of numerical features.

We frequently represent qualitative information in categories such as gender, t-shirt size, colors, or brand of car. However, not all categorical data is the same. Sets of categories with no intrinsic ordering are called nominal. Examples of nominal categories include:

  • Blue, Red, Green
  • Man, Woman
  • Banana, Strawberry, Apple 

In contrast, when a set of categories has some natural ordering we refer to it as ordinal. For example:

  • Small, Medium, Large Extra Large Low,
  • Agree, Neutral, Disagree,
  • Medium, High Young, Old

The problem is that most machine learning algorithms require inputs to be numerical values. In this tutorial, we will only cover techniques for encoding Ordinal data using Scikit-Learn’s OrdinalEncoder. 

Encoding Ordinal Categorical Features

Ordinal values have a strict order, but the fixed relationship between values no longer applies. A good example is ordering a small, medium, or large drink, with small mapped to the value 1, medium 2, and large 3. The large drink is bigger than the medium, in the same way that 3 is bigger than 2, but it doesn’t tell us anything about how much bigger. 

Here we have an ordinal categorical feature (e.g., small, medium, large), which we want to transform into numerical values.

import pandas as pd

drink_df = pd.DataFrame([['Coffe', 'Small', 9.5],
                   ['Drinks', 'Medium', 14.8 ],
                   ['Beer', 'Large', 11.5],
                   ['Wine', 'Medium', 10.5],
                   ['Milk', 'Small', 12.5],
                   ['Soda', 'Large', 14.5],
                   ])

drink_df.columns = ['Name', 'Size', 'Price']
print(drink_df)

First, we use the pandas DataFrame replace method to transform string labels to numerical equivalents: 

# Create mapper
scale_mapper = {"Small":1,
                "Medium":2,
                "Large":3}


  # Replace feature values with scale
drink_df['Size_en']=drink_df["Size"].replace(scale_mapper)
print(drink_df)
Encoding Ordinal Categorical Features

When encoding the feature for use in machine learning, we need to transform the ordinal classes into numerical values that maintain the notion of ordering. The most common approach is to create a dictionary that maps the string label of the class to a number and then apply that map to the feature.

Encoding Using OrdinalEncoder

Most machine learning algorithms prefer to work with numbers, so let’s convert these categories from text to numbers. For this, we can use Scikit-Learn’s OrdinalEncoder class: 

from sklearn.preprocessing import OrdinalEncoder

ordinal_encoder = OrdinalEncoder(categories=[[None,"Small", "Medium", "Large"]],dtype=int)
drink_df['Size_oren']=ordinal_encoder.fit_transform(drink_df['Size'].values.reshape(-1,1))
print(drink_df)

You can get the list of categories using the categories_ instance variable. It is a list containing a 1D array of categories for each categorical attribute (in this case, a list containing a single array since there is just one categorical attribute): 

ordinal_encoder.categories_ #[array([None, 'Small', 'Medium', 'Large'], dtype=object)]

Related Post

How to Normalize Categorical Variables?

Standardize Pandas DataFrame Using StandardScaler

Scaling Pandas DataFrame with MinMaxScaler

How to Encode multiple columns using Scikit-learn?

One Hot Encoding Pandas List Type Column Values.

Difference between LabelEncoder and OrdinalEncoder to Encode Categorical Values

Normalize, Scale, and Standardize Pandas DataFrame columns using Scikit-Learn

How to normalize, mean subtraction, standard deviation, and zero center image dataset in Python?