Tensors are the primary data structure by which PyTorch stores and manipulates numerical information. In PyTorch, tensors are utilized universally. They are used to represent the inputs to models, the weight layers within the models themselves, and the outputs of models.

PyTorch tensors can remember where they come from, in terms of the operations and parent tensors that originated them, and they can automatically provide the chain of derivatives of such operations with respect to their inputs.

PyTorch tensors can be initialized with the argument requires_grad,which when set to True, stores the tensor’s gradient in an attribute called grad.

params = torch.tensor([1.0, 0.0], requires_grad=True)

requires_grad=True argument to the tensor constructor telling PyTorch to track the entire family tree of tensors resulting from operations on params. In other words, any tensor that will have params as an ancestor will have access to the chain of functions that were called to get from params to that tensor. 

The value of the derivative will be automatically populated as a grad attribute of the params tensor. In general, all PyTorch tensors have an attribute named grad. Normally, it’s None:

print(params.grad) #None

PyTorch creates the autograd graph with the operations. When we call tensor.backward(), PyTorch traverses this graph in the reverse direction to compute the gradients.

x = torch.tensor(2.0, requires_grad=True)
y = torch.tensor(3.0, requires_grad=True)
z = torch.tensor(4.0, requires_grad=True)
f = x**2+y**2+z**2

print(x.grad, y.grad, z.grad,f)

#(tensor(4.), tensor(6.), tensor(8.), tensor(29., grad_fn=<AddBackward0>))

The call to backward() computes the partial derivative of the output f with respect to each of the input variables. In the case of neural networks, we can represent the neural network as f (x, θ) , where f is the neural network, x is some vector representing the input, and θ is the parameters of f.


If you want to freeze part of VGG16 pre-train PyTorch model and train the rest, you can set requires_grad of the parameters you want to freeze to False.

model = torchvision.models.vgg16(pretrained=True)
for param in model.features.parameters():
    param.requires_grad = False

By switching the requires_grad flags to False, no intermediate buffers will be saved, until the computation gets to some point where one of the inputs of the operation requires the gradient.


PyTorch allows us to switch off autograd when we don’t need it, using the “with torch.no_grad()” context manager. It is used to prevent calculating gradients in the following code block. It is used to evaluate the model and doesn’t need to call backward() to calculate the gradients and update the corresponding parameters.

x = torch.tensor([1.], requires_grad=True)
with torch.no_grad():
  y = x * 2
print(y.requires_grad) #False

def doubler(x):
    return x * 2
z = doubler(x)

print(z.requires_grad) #False

def doubler(x):
    return x * 2
z = doubler(x)

print(z.requires_grad) #True

Using the related set_grad_enabled context, we can also condition the code to run with autograd enabled or disabled, according to a Boolean expression—typically indicating whether we are running in training or inference mode.

Count Trainable Parameter

Counting parameters might require us to check whether a parameter has requires_grad set to True, as well. We might want to differentiate the number of trainable parameters from the overall model size. Let’s take a look at what we have right now:

numel_list = [p.numel() for p in model.parameters() if p.requires_grad == True]