Tensors are similar to NumPy’s ndarrays, except that tensors can run on GPUs or other hardware accelerators. In fact, tensors and NumPy arrays can often share the same underlying memory, eliminating the need to copy data. In PyTorch, we use tensors to encode the inputs and outputs of a model, as well as the model’s parameters.

In this tutorial, we will be discussed 3 functions that use for copy Tensors. Examples will be provided along with scenarios.

  • clone
  • detach
  • deepcopy

Clone

.clone() is useful to create a copy of the original variable that doesn’t forget the history of ops so it allows gradient flow and avoids errors with inlace ops. The main error of in-place ops is overwriting data needed for the backward pass or writing an in-place op to a leaf node, in this case, there would be no error message.

x=torch.ones(5,requires_grad=True)

y=x.clone()

print(x,'x')
print(y,'clone of x')

tensor([1., 1., 1., 1., 1.], requires_grad=True) x
tensor([1., 1., 1., 1., 1.], grad_fn=<CloneBackward0>) clone of x

clone() is recognized by autograd and the new tensor will get the grad function as grad_fn=<CloneBackward> and it creates a copy of the tensor that imitates the original tensor’s requires_grad field.

x=torch.ones(5,requires_grad=True)

y=x.clone()*2

z=x.clone()*3

sum=(y+z).sum()

sum.backward()

print(x.grad) # tensor([5., 5., 5., 5., 5.])

It should be noted that although the variables y and z are obtained using x clone, when sum.backward() is done, the gradients are propagated to the original tensor, so y is x*2 and z is x*3. Hence sum is x*2+x*3. Thus the derivative of the sum is 2x+3x. Hence gradient of x is 2.1+3.1 = 5 Thus, x.grad produces a vector of 5 elements each having the value of 5.

print(y.grad)

The .grad attribute of a Tensor that is not a leaf Tensor is being accessed.

We get no value when we try to obtain the gradient of x_clone. This is because x_clone is computed by the clone operation in x, so it is not a leaf variable that hasn’t grad. But the backpropagation will propagate it to x, so x will get the grad.

If you want to copy a tensor and detach it from the computation graph you should be using detach().

detach()

tensor.detach() creates a tensor that shares storage with a tensor that does not require grad. You should use detach() when attempting to remove a tensor from a computation graph and clone it as a way to copy the tensor while still keeping the copy as a part of the computation graph it came from.

x=torch.ones(5,requires_grad=True)

y=x*2

z=x.detach()*3

sum=(y+z).sum()

sum.backward()

print(x.grad) #tensor([2., 2., 2., 2., 2.])

y is x*2 and z is x*3. Hence sum is x*2+x*3. Thus the derivative of the sum is 2x+3x. But as z is calculated by detaching x, hence z is not included when calculating the gradient of x. Hence gradient of x is 2.1 = 2 Thus, x.grad produces a vector of 5 elements each having the value of 2.

inplace operation

a = torch.ones(5, requires_grad=True)

b = a**2
c = a.detach()
c.zero_()

b.sum().backward()

print(a.grad)

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation:

It detects that ‘a’ has changed inplace and this will trip gradient calculation. It is because .detach() doesn’t implicitly create a copy of the tensor, so when the tensor is modified later, it’s updating the tensor on the upstream side of .detach() too. By cloning first, this issue doesn’t arise.

clone().detach() vs detach().clone()

They will give an equivalent end result. The minor optimization of doing detach() first is that the clone operation won’t be tracked. if you do clone first, then the autograd info is created for the clone and after the detach because they are inaccessible, they are deleted. So the end result is the same, but you do a bit more useless work.

.clone().detach()

If we clone and then detach then we still have a new tensor with its own memory and we’ve blocked the gradient flow to the earlier graph.

.detach().clone()

We create a tensor that shares the same memory but forget the old gradient flow but then we made a clone of it, so now it has new memory for it but since its a copy of the detached, it still doesn’t have the gradient flow to the earlier part of the graph.

deepcopy

It creates a new tensor instance with its own meta-data, all of the meta-data should point to deep copies. it truly is a deep copy, expect a deep copy of the history. So it should do a deep replication of the history. Though this seems really expensive but at least semantically consistent with what deep copy should be.

import copy

x=torch.ones(5,requires_grad=True)

x_deepcopy = copy.deepcopy(x)

print(x,'x')
print(x_deepcopy,'x deepcopy')

#tensor([1., 1., 1., 1., 1.], requires_grad=True) x
#tensor([1., 1., 1., 1., 1.], requires_grad=True) x deepcopy

Conclusion

Whenever we want to make a copy of a tensor and ensure that any operations are done with the cloned tensor to ensure that the gradients are propagated to the original tensor, we must use clone(). We should use detach() when we don’t want to include a tensor in the resulting computational graph.

When it comes to Module, there is no clone method available so you can either use copy.deepcopy or create a new instance of the model and just copy the parameters, as proposed in this post Copy PyTorch Model using deepcopy() and state_dict()