PyTorch 2.0 is the next generation 2-series release of PyTorch. PyTorch 2.0 offers fundamentally changing and supercharging how PyTorch operates at the compiler level under the hood. It is able to provide faster performance and support for Dynamic Shapes and Distributed. It has first-class Python integration, imperative style, and simplicity of the API, and options.

In this tutorial, we will use torchdata library. It has common modular data loading primitives for easily constructing flexible and performant data pipelines.


In this tutorial, we will use Butterfly Image Classification Dataset from Kaggle. It has 75 different classes of Butterflies. The dataset contains about 1000+ labeled images including the validation images. Each image belongs to only one butterfly category. The label of each image is saved in Training_set.csv.


le = LabelEncoder()
butterfly_list['target'] = le.fit_transform(butterfly_list['label'])

PyTorch Load Custom Data using Datapipe

Load Data Using Datapipe

In this tutorial, we will use the DataPipe graph and load data via DataLoader2 with different backend systems (ReadingService). In this section, we will demonstrate how you can use DataPipe with DataLoader. For the most part, you should be able to use it just by passing dataset=datapipe as an input argument into the DataLoader2.

We will build our DataPipes to load Pandas dataframe using IterableWrapper Datapipe.

def load_image(datapipe):
    filename,label,target = datapipe
    img =
    return {'image':img, 'label':label,'target':target}

def filterRGB(img):
  return img.mode=='RGB'
datapipe = IterableWrapper(butterfly_list.values)
datapipe = datapipe.shuffle()
datapipe =
datapipe = datapipe.filter(filterRGB,input_col='image')

DataPipes can be invoked using their functional forms. You can find the full list of built-in IterDataPipes here and MapDataPipes here.

Split Datapipe using RandomSplitter

RandomSplitter(functional name: random_split) Randomly split samples from a source DataPipe into groups(like train, test,vald). By default, multiple iterations of this DataPipe will yield the same split for consistency across epochs.

train, valid = datapipe.random_split(total_length=sample_size, weights={"train": 0.8, "valid": 0.2}, seed=45)

weights determine the length of this list, and how many output DataPipes there will be. total_length is the length of the source_datapipe, optional but providing an integer is highly encouraged because not all IterDataPipe has len, especially ones that can be easily known in advance.

You can also specify a target key if you only need a specific group of samples.

train = datapipe.random_split(total_length=sample_size, weights={"train": 0.8, "valid": 0.2}, target='train',seed=45)

data_transform = transforms.Compose(
        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),

data_augment = transforms.Compose(

train_dp =,input_col='image')
train_dp =,input_col='image')


Be careful to use the same seed as before when specifying `target` to get the correct split.

Iterate DataLoader2

Lastly, we will pass the DataPipe into the DataLoader2. Note that if you choose to use Batcher while setting batch_size > 1 for DataLoader, your samples will be batched more than once. You should choose one or the other.

rs = MultiProcessingReadingService(num_workers=1)
dl = DataLoader2(train_dp, reading_service=rs)


figure = plt.figure(figsize=(10, 5))
cols, rows = 4, 2

for i in range(0, cols * rows):
    label = dp[i]['label']
    img = dp[i]['image']

    inp = img.numpy().transpose((1, 2, 0))
    mean = np.array([0.485, 0.456, 0.406])
    std = np.array([0.229, 0.224, 0.225])
    inp = std * inp + mean
    inp = np.clip(inp, 0, 1)
    figure.add_subplot(rows, cols, i+1)
PyTorch Split Datapipe

Run this code in Google Colab

Related Post