In machine learning problems a lot of effort goes into preparing the data. PyTorch provides many classes to make data loading easy and code more readable. In this tutorial, we will see how to load and preprocess/augment custom datasets.

PyTorch provides two class: torch.utils.data.DataLoader and torch.utils.data.Dataset that allows you to load your own data. Dataset stores the samples and their corresponding labels, and DataLoader wraps an iterable around the Dataset to enable easy access to the samples.

In this tutorial, we have seen how to write and use datasets, transforms, and DataLoader

Dataset

In this tutorial, we use the Movie Posters dataset. it is available on Kaggle which is enough for training a deep learning model and small enough for this post.

This Movie Posters dataset contains around 7800 images ranging from over 25 different genres of movies. First of all, do download the dataset and extract it.

A few rows of data from the CSV file of the dataset that we will use to train our deep learning model.

PyTorch Load Dataset

The Id column contains all the image file names and the Genre column contains all the genres that the movie belongs to.

Targets

Then we have 25 more columns with the genres as the column names. If a movie poster belongs to a particular genre, then that column value is 1, else it is 0.

We will start with preparing the dataset. We will divide the complete dataset into two parts. They are training and testing.

df=pd.read_csv("/content/Multi_Label_dataset/train.csv")
train_set,test_set=train_test_split(df,test_size=0.25)

We will use torchvision and torch.utils.data packages for loading the data.

import torch
import torchvision.transforms as transforms
import torchvision

from torch.utils.data import DataLoader,Dataset

import cv2
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Dataset Class

A custom Dataset class must implement three functions: __init__, __len__, and __getitem__. Take a look at this implementation; the MoviePoster images are stored in a directory img_folder, and their labels are stored separately in a CSV file.

class ImageDataset(Dataset):
  def __init__(self,csv,img_folder,transform):
    self.csv=csv
    self.transform=transform
    self.img_folder=img_folder
    
    self.image_names=self.csv[:]['Id']
    self.labels=np.array(self.csv.drop(['Id', 'Genre'], axis=1))
  
#The __len__ function returns the number of samples in our dataset.
  def __len__(self):
    return len(self.image_names)

  def __getitem__(self,index):
    
    image=cv2.imread(self.img_folder+self.image_names.iloc[index]+'.jpg')
    image=cv2.cvtColor(image,cv2.COLOR_BGR2RGB)

    image=self.transform(image)
    targets=self.labels[index]
    
    sample = {'image': image,'labels':targets}

    return sample

__init__

The __init__ function is run once when instantiating the Dataset object. We initialize the directory containing the images, the CSV file, and transforms.

__getitem__

The __getitem__ function loads and returns a sample from the dataset at the given index idx. Based on the index, it identifies the image’s location on disk, converts that to a tensor, retrieves the corresponding label, calls the transform functions on them, and returns the tensor image and corresponding label in a tuple.

Transforms

Most neural networks expect images of a fixed size. Therefore, we will need to write some preprocessing code. Let’s create transforms:

train_transform = transforms.Compose([
                transforms.ToPILImage(),
                transforms.Resize((200, 200)),
                transforms.RandomHorizontalFlip(p=0.5),
                transforms.RandomRotation(degrees=45),
                transforms.ToTensor()])

test_transform =transforms.Compose([
                transforms.ToPILImage(),
                transforms.Resize((200, 200)),
                transforms.ToTensor()])

train_dataset=ImageDataset(train_set,img_folder,train_transform)
test_dataset=ImageDataset(test_set,img_folder,test_transform)

DataLoaders

You can retrieve one sample at a time from the dataset. While training a model, we typically want to pass samples in “mini batches”, reshuffle the data at every epoch to reduce model overfitting, and use Python’s multiprocessing to speed up data retrieval. DataLoader is iterable that abstracts this complexity for us in an easy API.

train_dataset=ImageDataset(train_set,img_folder,train_transform)
test_dataset=ImageDataset(test_set,img_folder,test_transform)

train_dataloader = DataLoader(
    train_dataset, 
    batch_size=BATCH_SIZE,
    shuffle=True
)

test_dataloader = DataLoader(
    test_dataset, 
    batch_size=4,
    shuffle=True
)

Iterate DataLoader

We have loaded that dataset into the DataLoader and can iterate through the dataset as needed. Each iteration below returns a batch of train_features and train_labels. It contains batch_size=32 features and labels respectively. We specified shuffle=True, after we iterate over all batches the data is shuffled.

sample = next(iter(train_dataloader))

Visualize Images

Let’s visualize a few training images so as to understand the data augmentations. We use matplotlib to visualize some samples in our training data.

def imshow(inp, title=None):
    """imshow for Tensor."""
    inp = inp.numpy().transpose((1, 2, 0))
    inp = np.clip(inp, 0, 1)
    plt.imshow(inp)


# Get a batch of training data
images = next(iter(test_dataloader))

# Make a grid from batch
output = torchvision.utils.make_grid(images['image'])

imshow(output)

Related Post

Run this code in Google colab