Datapipe and dataloader2 provide flexible and reusable building blocks that allow you to create your data pipeline from scratch very quickly and without rewriting commonly used code.

In this tutorial, we’ll talk about some major components within this library. Then we’re going to present a demo to showcase how DataPipe and DataLoader2 work. We will start off by showing you how to load data with the built-in Datapipe provided by TorchData.

PyTorch 2.0 improve the API and data loading experience. The existing implementation of DataLoader is overloaded with too many parameters and options in one single place.

Dataset

The flowers dataset consists of examples which are labeled images of flowers. Each example contains a JPEG flower image and the class label: what type of flower it is. Let’s display a few images together with their labels.

PyTorch DataLoader2-DataPipe

The flowers dataset consists of images of flowers with 5 possible class labels. 

!wget http://download.tensorflow.org/example_images/flower_photos.tgz
!tar zxvf /content/flower_photos.tgz

Create DataPipe

We’ll take a look at some major components of the library starting with DataPipe. The DataPipe can be described as a series of steps that your program executes in order to prepare your samples for training or inference. These steps can be seen as a graph.

TorchData introduced composable and reusable building blocks that are called Datapipe in TorchData. Here’s an example of what a data pipeline would look like using a built-in DataPipe.

import os
import PIL
import numpy as np

from PIL import Image
from io import BytesIO

import torch

import torchvision.transforms as transforms
from torchdata.datapipes.iter import StreamReader,FileOpener,FileLister

from torchdata.dataloader2 import DataLoader2, MultiProcessingReadingService

import matplotlib.pyplot as plt

print(torch.__version__)

cat_to_id={'daisy':0,'dandelion':1,'roses':2,'sunflowers':3,'tulips':4}
id_to_cat=['daisy','dandelion','roses','sunflowers','tulips']
data_transform = transforms.Compose(
    [
        transforms.RandomResizedCrop(224),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
    ]
)


def file_was_jpg(filename):
    return filename.endswith(".jpg")

def load_image(datapipe):
    path, stream = datapipe
    label = path.split("/")[-2]
    label = cat_to_id[label]
    img = Image.open(BytesIO(stream))
    img = data_transform(img)
    return img, label

FOLDER = '/content/flower_photos'
datapipe = FileLister([FOLDER],recursive=True).filter(filter_fn=file_was_jpg)
datapipe = FileOpener(datapipe, mode='b')
datapipe = StreamReader(datapipe)
datapipe = datapipe.map(fn=load_image)
#datapipe = datapipe.shuffle()
datapipe = datapipe.batch(8)

You start off by listing out files on your file system and filtering them and shuffling them. Each DataPipe performs a small transformation by the opening file map pickup function filtering shuffling batching. These composable and reusable building blocks will simplify the process of building new data pipelines.

Create DataLoader2

Now we have a concept of a graph of data pipes so it’s about now “how to load data from the script”. Dataloader2 is a new data front-end API. It will provide most of the user-facing API, like iterating the graph yielding data resetting random states.

DataLoader2 is going to be a user-facing API for data loading and the first argument of DataLoader2 is DataPipe. It’s going to be a data pipe graph with the different data transformations.

rs = MultiProcessingReadingService(num_workers=1)
dl = DataLoader2(datapipe, reading_service=rs)

The second argument for DataLoader2 is a reading service that can help DataLoader2 to modify and execute the DataPipe graph based on their specific use cases. The third one is an adapter, it’s an optional argument and it will help properly write and configure settings.

Visualize Images from DataLoader2

After you have this implementation you can quickly do a sanity check to make sure the result is correct. Let’s visualize a few training images so as to understand the data. 

dp=next(iter(dl))

figure = plt.figure(figsize=(8, 4))
cols, rows = 4, 2

for i in range(0, cols * rows):
    
    img=dp[i][0]
    label=dp[i][1]
    
    inp = img.numpy().transpose((1, 2, 0))
    mean = np.array([0.485, 0.456, 0.406])
    std = np.array([0.229, 0.224, 0.225])
    inp = std * inp + mean
    inp = np.clip(inp, 0, 1)
    
    figure.add_subplot(rows, cols, i+1)
    
    plt.title(id_to_cat[label])
    plt.axis("off")
    plt.imshow(inp)
plt.show()

We can index DataPipe manually like a list: data[index]. We use matplotlib to visualize some samples in our training data.

Related Post

Run this code in Google Colab