In a real-world setting, the input data may be stored remotely for example, on Google Cloud Storage or HDFS. A dataset pipeline that works well when reading data locally might become bottlenecked on I/O when reading data remotely because reading the first byte of a file from remote storage can take orders of magnitude longer than from local storage.

In addition, once the raw bytes are loaded into memory, it may also be necessary to deserialize, which requires additional computation. This overhead is present irrespective of whether the data is stored locally or remotely but can be worse in the remote case if data is not prefetched effectively.

GPUs and TPUs can radically reduce the time required to execute a single training batch. Achieving peak performance requires an efficient input pipeline that delivers data for the next batch before the current batch has finished. The PyTorch DataLoader helps to build flexible and efficient input pipelines. 

When preparing data, input elements may need to be pre-processed. To this end, the PyTorch Data API offers the transforms to transformation to each element of the input dataset. Because input elements are independent of one another, the pre-processing can be parallelized across multiple CPU cores. To make this possible.

While your pipeline is fetching the data, your model is sitting idle. Conversely, while your model is training, the input pipeline is sitting idle. The training step time is thus the sum of opening, reading, and training times. This tutorial demonstrates how to use the num_workers parameter of DataLoader API to build multi-process data loading input pipelines.

While the model is executing training step s, the input pipeline is reading the data for step s+1. Doing so reduces the step time to the maximum (as opposed to the sum) of the training and the time it takes to extract the data.

Python Global Interpreter Lock (GIL) prevents true fully parallelizing Python code across threads. To avoid blocking computation code with data loading, PyTorch provides an easy switch to perform multi-process data loading by simply setting the argument num_workers to a positive integer.

Single-process data loading 

In this mode, data fetching is done in the same process a DataLoader is initialized. Therefore, data loading may block computing. However, this mode may be preferred when a resource(s) used for sharing data among processes are limited, or when the entire dataset is small and can be loaded entirely in memory.

DataLoader(ds, num_workers=0)

Single-process loading often shows more readable error traces and thus is useful for debugging.

Multi-process data loading

Setting the argument num_workers as a positive integer will turn on multi-process data loading with the specified number of loader worker processes. In this mode, each time an iterator of a DataLoader is created or when you call enumerate(dataloader), num_workers worker processes are created.

DataLoader(ds, num_workers=2)

At this point, the dataset, collate_fn, and worker_init_fn are passed to each worker, where they are used to initialize, and fetch data. This means that dataset access together with its internal IO, transforms (including collate_fn) runs in the worker process.

If you set num_workers = 1, there’s only 1 parallel process running to generate the data, which might’ve caused your GPU to sit idle for the period your data gets available by that parallel process. As you increase the no. of workers (processes), more processes are now running in parallel to fetch data in batches essentially using CPU’s multiple cores to generate the data.

DataLoader doesn’t just randomly return from what’s available in RAM right now, it uses a batch sampler to decide which batch to return next. Each batch is assigned to a worker, and the main process will wait until the desired batch is retrieved by the assigned worker.

Choosing the best value for the num_workers argument depends on your hardware, characteristics of your training data (such as its size and shape), the cost of your transform function, and what other processing is happening on the CPU at the same time. A simple heuristic is to use the number of available CPU cores.

Memory Usage

You probably noticed memory usage getting out of control when training a Pytorch model. Because Dataloaders utilize multiprocessing under the hood, which leads to unfortunate interactions with shared memory across iterations of Dataloader __getitem__ calls.

Storing arbitrary Python objects in shared memory triggers copy-on-write behavior, due to the addition of reference counting every time something reads from these objects.

If your Dataloaders iterate across a list of filenames, the references to that list add up over time, occupying memory. This is not a memory leak, but a copy-on-access problem of forked python processes due to changing refcounts.

It isn’t a Pytorch issue either but simply is due to how Python is structured. Iterating across a huge list leads to memory increasing over time while iterating across a NumPy array doesn’t. Each element in a list has its own refcount, while a NumPy array only has a single refcount. There’s an exception; NumPy arrays with type objects behave like lists in this context. The simplest workaround is to replace native Python objects (dicts, lists) with array objects that only have one refcount (pandas, numpy, pyarrow).