Extracting a subset of data is the most common operation in data analysis. A subset is a portion of a larger data set that fits some kind of condition. 

Suppose that we want to generate a list of all Comedy movies. To accomplish this task, we need to filter our Movie data set based on the values in the Genre column. The list of Comedy movies is a subset of all movies. 

Pandas List Type Column

Sometimes, Pandas DataFrame has multiple values in the same column cell. Here, each movie poster can belong to at least one genre and can have at most 3 genre labels assigned to them. The genre labels are stored in a pipe-separated string.

import pandas as pd

#https://www.kaggle.com/datasets/neha1703/movie-genre-from-its-poster/data
df=pd.read_csv('/content/MovieGenre.csv',encoding = "ISO-8859-1")

df.Genre.dtype #dtype('O')

The main problem with lists in pandas is that most of Pandas’ integrated functions aren’t compatible with them. Pandas store every value in a column as one data type. For example, strings, integers, floats, or datetime values, this doesn’t pose a problem.

Pandas treats every single list item as a single string and it is unable to work with the individual components of these strings. In our DataFrame, each list of Genre is registered as a string object.

There is an easy way to fix this using the split(“|”) method. str.split() method uses a delimiter to split a string into substrings. We can split each Genre string by the presence of a pipe(|). Let’s overwrite the original Genre column with the new one: 

df.Genre = df.Genre.astype(str)
df.Genre=df["Genre"].str.split("|")

Filter using apply() and in.

Some filtering operations are more complex than simple equality or inequality checks. Luckily, pandas ship with many helper methods that generate Boolean Series for these types of extractions.

The Pandas apply() function is used to apply the custom functions on the Pandas objects. To apply some specific functions, we can leverage the apply() function. Pandas apply() is both the Series method and DataFrame method.

A lambda expression is a way of creating a little function inline, without all the syntax of a def. Here is a lambda with a single n parameter, returning the true or false.

df[df.Genre.apply(lambda x: 'Comedy' in x)]
Query list-type column Pandas

Filter using map() and issubset()

The Pandas map() function is used to map each value from a Series object to another value using a dictionary/function/Series. It is a convenient function to map the values of a Series from one domain to another domain.

df[df.Genre.map(set(["Adventure"]).issubset)]

A subset is a type of set that is part of another set.issubset() is the set method used to check if a set is a subset of another set or if a whole set is part of another set.

Filter using str.content()

Any time we’d like to perform string manipulations, we invoke a method on the StringMethods object rather than the Series itself. Some methods work like Python’s native string methods, whereas other methods are exclusive to pandas. 

Suppose that we want to isolate all establishments with the word “Drama”. In vanilla Python, we use the in operator to search for a substring with a string: 

"Drama" in "Comedy, Family, Romance, Drama”

The biggest challenge in string matching is case sensitivity. Python will not find the string “drama” in “Jets Drama”, for example, because of the mismatch in the casing of the “d”character.

To solve this problem, we need to ensure consistent casing across all column values before we check for the presence of a substring. We can look for a lowercase “drama” in an all-lowercase Series or an uppercase “DRAMA” in an all-uppercase Series.

df[df['Genre'].str.lower().str.contains('comedy', regex=False)]

The contains method checks for a substring’s inclusion in each Series value. The method returns True when pandas find the method’s argument within the row’s string and False when it does not. 

Notice that pandas preserve the original letter casing of the values in Name. The inspections DataFrame is never mutated. The lower method returns a new Series, and the contains method we invoke on it returns another new Series, which pandas use to filter rows from the original DataFrame. 

Related Post

One Hot Encoding Pandas List Type Column Values.

Filter Pandas Dataframe using OR(|) AND(&) with Query()