Datapipe and dataloader2 provide flexible and reusable building blocks that allow you to create your data pipeline from scratch very quickly and without rewriting commonly used code.
In this tutorial, we’ll talk about some major components within this library. Then we’re going to present a demo to showcase how DataPipe and DataLoader2 work. We will start off by showing you how to load data with the built-in Datapipe provided by TorchData.
PyTorch 2.0 improve the API and data loading experience. The existing implementation of DataLoader is overloaded with too many parameters and options in one single place.
Dataset
The flowers dataset consists of examples which are labeled images of flowers. Each example contains a JPEG flower image and the class label: what type of flower it is. Let’s display a few images together with their labels.

The flowers dataset consists of images of flowers with 5 possible class labels.
!wget http://download.tensorflow.org/example_images/flower_photos.tgz
!tar zxvf /content/flower_photos.tgz
Create DataPipe
We’ll take a look at some major components of the library starting with DataPipe. The DataPipe can be described as a series of steps that your program executes in order to prepare your samples for training or inference. These steps can be seen as a graph.
TorchData introduced composable and reusable building blocks that are called Datapipe in TorchData. Here’s an example of what a data pipeline would look like using a built-in DataPipe.
import os
import PIL
import numpy as np
from PIL import Image
from io import BytesIO
import torch
import torchvision.transforms as transforms
from torchdata.datapipes.iter import StreamReader,FileOpener,FileLister
from torchdata.dataloader2 import DataLoader2, MultiProcessingReadingService
import matplotlib.pyplot as plt
print(torch.__version__)
cat_to_id={'daisy':0,'dandelion':1,'roses':2,'sunflowers':3,'tulips':4}
id_to_cat=['daisy','dandelion','roses','sunflowers','tulips']
data_transform = transforms.Compose(
[
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
]
)
def file_was_jpg(filename):
return filename.endswith(".jpg")
def load_image(datapipe):
path, stream = datapipe
label = path.split("/")[-2]
label = cat_to_id[label]
img = Image.open(BytesIO(stream))
img = data_transform(img)
return img, label
FOLDER = '/content/flower_photos'
datapipe = FileLister([FOLDER],recursive=True).filter(filter_fn=file_was_jpg)
datapipe = FileOpener(datapipe, mode='b')
datapipe = StreamReader(datapipe)
datapipe = datapipe.map(fn=load_image)
#datapipe = datapipe.shuffle()
datapipe = datapipe.batch(8)
You start off by listing out files on your file system and filtering them and shuffling them. Each DataPipe performs a small transformation by the opening file map pickup function filtering shuffling batching. These composable and reusable building blocks will simplify the process of building new data pipelines.
Create DataLoader2
Now we have a concept of a graph of data pipes so it’s about now “how to load data from the script”. Dataloader2 is a new data front-end API. It will provide most of the user-facing API, like iterating the graph yielding data resetting random states.
DataLoader2 is going to be a user-facing API for data loading and the first argument of DataLoader2 is DataPipe. It’s going to be a data pipe graph with the different data transformations.
rs = MultiProcessingReadingService(num_workers=1)
dl = DataLoader2(datapipe, reading_service=rs)
The second argument for DataLoader2 is a reading service that can help DataLoader2 to modify and execute the DataPipe graph based on their specific use cases. The third one is an adapter, it’s an optional argument and it will help properly write and configure settings.
Visualize Images from DataLoader2
After you have this implementation you can quickly do a sanity check to make sure the result is correct. Let’s visualize a few training images so as to understand the data.
dp=next(iter(dl))
figure = plt.figure(figsize=(8, 4))
cols, rows = 4, 2
for i in range(0, cols * rows):
img=dp[i][0]
label=dp[i][1]
inp = img.numpy().transpose((1, 2, 0))
mean = np.array([0.485, 0.456, 0.406])
std = np.array([0.229, 0.224, 0.225])
inp = std * inp + mean
inp = np.clip(inp, 0, 1)
figure.add_subplot(rows, cols, i+1)
plt.title(id_to_cat[label])
plt.axis("off")
plt.imshow(inp)
plt.show()
We can index DataPipe manually like a list: data[index]. We use matplotlib to visualize some samples in our training data.
Related Post
- Create your own Custom Iterable DataPipe for Image Dataset
- Image Normalization and Augmentation in DataPipe PyTorch 2.0
- Install PyTorch 2.0 GPU/MPS for Mac M1/M2 with Conda