Sometimes, Pandas DataFrame has multiple values in the same column cell. We may want to break up the data cluster so that each column stores a single value. 

If you look closely, you will find that lists are everywhere. Here are some practical problems, where you will probably encounter list values.

  • Lists of all authors, artists, and producers.
  • List of favorite fruits.
  • Audio or video tags.

DataSet

In this tutorial, we use the Movie Genre dataset from Kaggle. The dataset contains IMDB ID, IMDB Link, Title, IMDB Score, Genre, and link to download movie posters. 

Each Movie poster can belong to at least one genre and can have at most 3 genre labels assigned to them. The genre labels are stored in a pipe-separated string.

import pandas as pd

#https://www.kaggle.com/datasets/neha1703/movie-genre-from-its-poster/data
df=pd.read_csv('/content/MovieGenre.csv',encoding = "ISO-8859-1")
Pandas List Type Column

The main problem with lists in pandas is that most of Pandas’ integrated functions aren’t compatible with them. Pandas store every value in a column as one data type. For example, strings, integers, floats, or datetime values, this doesn’t pose a problem.

As you can see, Pandas treats every single list item as a single string and it is unable to work with the individual components of these strings. In our DataFrame, each list of Genre is registered as a string object.

There is an easy way to fix this using the split(“|”) method. str.split() method uses a delimiter to split a string into substrings. We can split each Genre string by the presence of a pipe(|). Let’s overwrite the original Genre column with the new one: 

df.Genre = df.Genre.astype(str)
df.Genre=df["Genre"].str.split("|")

print(df.Genre.dtype) #dtype('O')

Note that Pandas will still categorize the series as the object datatype (“O”), which is typically used for strings. This isn’t a problem, but might confuse you at first glance.

One Hot Encoding using MultiLabelBinarizer

MultiLabelBinarizer() is a class in the scikit-learn library for Python. It is used to transform multi-label labels into a binary representation for use in machine learning algorithms. 

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer(sparse_output=True)

sparse_output=True represents your data in a sparse formatting. This saves a lot of memory when you have an array where most of the elements are zero.

df = df.join(pd.DataFrame.sparse.from_spmatrix(
                mlb.fit_transform(df.pop('Genre')),
                index=df.index,
                columns=mlb.classes_))

df.shape #(40108, 34)

It converts a label array of multiple classes into a binary matrix where each column represents one of the possible classes and each row represents an instance.

Related Post

Difference between LabelEncoder and OrdinalEncoder to Encode Categorical Values

Encoding Ordinal Categorical Features using OrdinalEncoder

How to Normalize Categorical Variables?

How to Encode multiple columns using Scikit-learn?

Scaling Pandas DataFrame with MinMaxScaler

Standardize Pandas DataFrame Using StandardScaler

Query list-type column in Pandas.

Normalize, Scale, and Standardize Pandas DataFrame columns using Scikit-Learn