We should not only compare the GPU execution time relative to the CPU execution time to decide whether to run on GPU or CPU. We also need to consider the cost of moving data across the PCI-e bus, especially when we are initially porting code to CUDA. Because CUDA’s heterogeneous programming model uses both the CPU and GPU, code can be ported to CUDA one kernel at a time. In the initial stages of porting, data transfers may dominate the overall execution time. It’s worthwhile to keep tabs on time spent on data transfers separately from time spent in kernel execution.

The peak bandwidth between the device memory and the GPU is much higher than the peak bandwidth between the host memory and device memory. This disparity means that your implementation of data transfers between the host and GPU devices can make or break your overall PyTorch model training performance. You can achieve higher bandwidth between the host and the device using pinned memory.

Pinned Host Memory

CPU data allocations are pageable by default. The GPU cannot access data directly from pageable host memory, so when a data transfer from pageable host memory to device memory is invoked, the CUDA driver must first allocate a temporary page-locked, or “pinned”, host array, copy the host data to the pinned array, and then transfer the data from the pinned array to device memory.

pinned memory PyTorch

Pinned memory is used as a staging area for transfers from the device to the host. We can avoid the cost of the transfer between pageable and pinned host arrays by directly allocating our host arrays in pinned memory.

CPU to GPU copies is much faster when they originate from pinned memory. CPU tensors and storages expose a pin_memory() method, that returns a copy of the object, with data put in a pinned region.

pin_memory=True

For data loading, passing pin_memory=True to a DataLoader will automatically put the fetched data Tensors in pinned memory, and enables faster data transfer to CUDA-enabled GPUs.

trainloader=DataLoader(data_set,batch_size=32,shuffle=True,num_workers=2,pin_memory=True)

You can make the DataLoader return batches placed in pinned memory by passing pin_memory=True to its constructor.

The default memory pinning logic only recognizes Tensors and maps and iterables containing Tensors. By default, if the pinning logic sees a batch that is a custom type (which will occur if you have a collate_fn that returns a custom batch type), or if each element of your batch is a custom type, the pinning logic will not recognize them, and it will return that batch (or those elements) without pinning the memory. To enable memory pinning for custom batch or data type(s), define a pin_memory() method on your custom type(s).

class SimpleCustomBatch:
    def __init__(self, data):
        transposed_data = list(zip(*data))
        self.inp = torch.stack(transposed_data[0], 0)
        self.tgt = torch.stack(transposed_data[1], 0)

    # custom memory pinning method on custom type
    def pin_memory(self):
        self.inp = self.inp.pin_memory()
        self.tgt = self.tgt.pin_memory()
        return self

def collate_wrapper(batch):
    return SimpleCustomBatch(batch)

inps = torch.arange(10 * 5, dtype=torch.float32).view(10, 5)
tgts = torch.arange(10 * 5, dtype=torch.float32).view(10, 5)
dataset = TensorDataset(inps, tgts)

loader = DataLoader(dataset, batch_size=2, collate_fn=collate_wrapper,
                    pin_memory=True)

for batch_ndx, sample in enumerate(loader):
    print(sample.inp.is_pinned())
    print(sample.tgt.is_pinned())

set non_blocking to True

Once you pin a tensor or storage, you can use asynchronous GPU copies. Just pass an additional non_blocking=True argument to a to() or a cuda() call. This can be used to overlap data transfers with computation.

for data in tqdm(trainloader):
    
    inputs=data[0].to(device, non_blocking=True)
    labels=data[1].to(device, non_blocking=True)
    
     # The following code will be called asynchronously,
    # such that the kernel will be launched and returns control 
    # to the CPU thread before the kernel has actually begun executing
    # has to wait for data to be pushed onto device (synch point)
    
    outputs=model(inputs)
    
    loss=loss_fn(outputs,labels)
    
    #Replaces pow(2.0) with abs() for L1 regularization
    
    l2_lambda = 0.001
    l2_norm = sum(p.pow(2.0).sum()
                  for p in model.parameters())

    loss = loss + l2_lambda * l2_norm
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    running_loss += loss.item()
    
    _, predicted = outputs.max(1)
    total += labels.size(0)
    correct += predicted.eq(labels).sum().item()