Machine learning models perform a lot of mathematical operations, and to perform mathematical operations in our data, we must make sure all the data is numerical. 

If we have any columns with categorical data, we must turn them into numbers. In this tutorial, we learn two ways to do this effectively using Scikit-Learn’s OrdinalEncoder and LabelEncoder classes. Let’s First understand Ordinal vs Nominal.

Ordinal vs Nominal Categorical Values 

Continuous values are the most intuitive when represented as numbers. They are strictly ordered, and a difference between various values has a strict meaning. For example, package A is 2 kilograms heavier than package B, or that package B came from 100 miles farther away than A has a fixed meaning.

The ordinal values have strict order, but the fixed relationship between values no longer applies. For example, ordering a small, medium, or large drink, with small mapped to the value 1, medium 2, and large 3. The large drink is bigger than the medium, in the same way that 3 is bigger than 2, but it doesn’t tell us anything about how much bigger.

Finally, categorical values have neither ordering nor numerical meaning to their values. These are often just enumerations of possibilities assigned to arbitrary numbers. Assigning water to 1, coffee to 2, soda to 3, and milk to 4 is a good example. 

There’s no real logic to placing water first and milk last; they simply need distinct values to differentiate them. Because the numerical values bear no meaning, they are said to be on a nominal scale. 

LabelEncoder vs OrdinalEncoder

You can use OrdinalEncoder to preserve the order of categorical data i.e. small, medium, large, extra large or low, medium, high. You can use LabelEncoder for categorical data where there’s no order in data i.e. red, green, blue.

import pandas as pd

df = pd.DataFrame([['red', 'S', 10.1,   'class2'],
                   ['green', 'M', 12.5, 'class1'],
                   ['black', 'L', 14.5,   'class1'],
                   ['yellow', 'XL', 16.5, 'class2'],
                   ['blue', 'S', 11.5,   'class2'],
                   ['black', 'XL', 17.5, 'class2'],
                   ['yellow', 'L', 14.5, 'class1'],
                   ['blue', 'M', 11.3, 'class2']])

df.columns = ['color', 'size', 'price', 'classlabel']

Ordinal encoding should be used for ordinal variables (where order matters, like cold, warm, hot).

from sklearn.preprocessing import OrdinalEncoder

ordinal_encoder = OrdinalEncoder(categories=[[None, "S", "M", "L", "XL"]],dtype=int)
df['size_en']=ordinal_encoder.fit_transform(df['size'].values.reshape(-1,1))
OrdinalEncoder

When OrdinalEncoder is initiated it takes parameter categories that determine order. Label encoding should be used for non-ordinal (aka nominal) variables (where order doesn’t matter, like blonde, brunette).

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['classlabel_en']=le.fit_transform(df['classlabel'])
LabelEncoder

LabelEncoder assigned integers in alphabetical order ‘l’<‘m'<‘s’.OrdinalEncoder can fit data that has the shape of (n_samples, n_features) while LabelEncoder can only fit data that has the shape of (n_samples,).

ValueError: Found unknown categories [‘P’] in column 0 during fit

The problem is that the OrdinalEncoder has encountered a value in the data set that it had not seen in the categories you define. This is fine. You just need to add the ‘handle_unknown’ argument to your OrdinalEncoder.

ordinal_encoder = OrdinalEncoder(categories=[[None, "S", "M", "L", "XL"]],unknown_value=9,handle_unknown='use_encoded_value',dtype=int)

handle_unknown parameter has default=’error’ When set to ‘error’ an error will be raised in case an unknown categorical feature is present during transform.

When set to ‘use_encoded_value’, the encoded value of unknown categories will be set to the value given for the parameter unknown_value. In inverse_transform, an unknown category will be denoted as None.

Related Post

Encoding Ordinal Categorical Features using OrdinalEncoder

How to Normalize Categorical Variables?

One Hot Encoding Pandas List Type Column Values.

How to Encode multiple columns using Scikit-learn?

Standardize Pandas DataFrame Using StandardScaler

Scaling Pandas DataFrame with MinMaxScaler