In reality, the available raw data is unformatted, dirty, and improper for a machine-learning model. It requires several steps of cleaning and feature engineering.

In the context of structured data, a developer needs to deal with all sorts of problems, like missing values, denormalized data, unformatted strings, duplicated rows, etc.

You need to improve data representation by normalizing numerical features, embedding categorical features, creating more meaningful columns, and many other steps to increase the ML model performance or improve the quality of dashboards.

Categorical Features

One common type of nonnumerical data is categorical data. For example, imagine you are exploring some data on housing prices, along with numerical features like ‘price’, ‘area’, ‘bedrooms’ etc. For example, your data might look something like this: 

import pandas as pd

df=pd.read_csv("/content/Housing.csv")
df.head(10)
Multiple columns using OrdinalEncoder

You might be tempted to encode this data with a straightforward numerical mapping.

{'furnished': 1, 'semi-furnished': 2, 'unfurnished': 3};

In a previous post — LabelEncoder vs OrdinalEncode in Machine Learning — I demonstrated how to use label encoding and ordinal encoding to separate categorical text data into numbers and different columns.

Encoding Multiple Columns Using OrdinalEncoder

There is no need to implement a custom class to label encoding multiple columns. You can simply use OrdinalEncoder.It avoids to create a LabelEncoder object for each column.

from sklearn.preprocessing import OrdinalEncoder

enc = OrdinalEncoder(dtype=int,categories=[['yes','no'],['yes','no'],['yes','no'],['yes','no'],['yes','no'],
                                           ['yes','no'],[None,'furnished','semi-furnished','unfurnished']])

columns=['mainroad','guestroom','basement','hotwaterheating','airconditioning','prefarea','furnishingstatus']

df[columns]= enc.fit_transform(df[columns])
multiple columns Label encoding using scikit-learn

categories list: categories[i] holds the categories expected in the ith column. The passed categories should not mix strings and numeric values within a single feature and should be sorted in case of numeric values.

Encoding Multiple Columns Using OneHotEncoder

But it turns out that this is not generally a useful approach in Scikit-Learn. The package’s models make the fundamental assumption that numerical features reflect algebraic quantities, so such a mapping would imply.

For example, that furnished < semi-furnished < unfurnished, or even that unfurnished– semi-furnished = furnished, which does not make much sense. 

In this case, one proven technique is to use one-hot encoding, which effectively creates extra columns indicating the presence or absence of a category with a value of 1 or 0, respectively. Scikit-Learn’s OneHotEncoder will do this for you:

from sklearn.preprocessing import OneHotEncoder

enc = OneHotEncoder(handle_unknown='ignore',sparse_output=False,dtype=int)

df = df.join(pd.DataFrame(
                enc.fit_transform(df['furnishingstatus'].values.reshape(-1, 1)),
                index=df.index,
                columns=enc.get_feature_names_out()))
OneHot Label encoding

Notice that the furnishingstatus column has been expanded into three separate columns representing the three furnishingstatus labels and that each row has a 1 in the column associated with its neighborhood. With these categorical features thus encoded, you can proceed as normal with fitting a Scikit-Learn model. 

Related Post

One Hot Encoding Pandas List Type Column Values.

Difference between LabelEncoder and OrdinalEncoder to Encode Categorical Values

Encoding Ordinal Categorical Features using OrdinalEncoder

How to Normalize Categorical Variables?