How loss.backward(), optimizer.step() and optimizer.zero

In order to compute the derivative of the loss with respect to a parameter, we can apply the chain rule and compute the derivative of the loss with respect to its input which is the output of the model, times the derivative of the model with respect to the parameter.

In the case of neural networks, we can represent the neural network as f (x, θ), where f is the neural network, x is some vector representing the input, and θ is the parameters of f.

We compute the gradient of the loss of the output of f with respect to θ. Adjusting θ via the gradient will eventually lead to a setting of θ that results in a small loss for the training data and one that hopefully generalizes to data that f hasn’t seen before.

The call to loss.backward() computes the partial derivative of the output f with respect to each of the input variables.

In this section, we will cover the uses of the loss.backward(),optimizer.step() and optimizer.zero_grad() function to train the PyTorch model. For example, to initialize a weight matrix needed for a feed-forward neural network, you can use the following code:

import torch
from torch import optim

in_dim, out_dim = 5, 2

inputs = torch.randn(in_dim)
target = torch.tensor([1,2],dtype=torch.float32)

model = torch.nn.Linear(in_dim, out_dim, bias=True)
out = model(inputs)

This defines a single layer with bias in a feed-forward neural network, which is a matrix of weights that takes as input a vector of dimension 5 and outputs a vector of dimension 2. The last line of code demonstrates how we can easily apply this layer to an input vector and store the output in a new tensor.

Loss.backward()

In our little model, we just saw a simple example of backpropagation. We can compute the gradient of the model with respect to their innermost parameters (w and b) by propagating derivatives backward using the chain rule.

In other words, any tensor that will have requires_grad=True params as an ancestor will have access to the chain of functions that were called to get from params to that tensor. In general, all PyTorch tensors have an attribute named grad. Normally, it’s None.

We can access all model parameters via the parameters() function provided by the PyTorch API. To view the shape of each parameter in the neural network, you can run the code:

loss = torch.nn.CrossEntropyLoss()
computed_loss = loss(out, target)

for p in model.parameters():
  print(p.grad)

#None
#None

All we have to do to populate it is to start with a tensor with requires_grad set to True, then call the model and compute the loss, and then call backward on the loss.

computed_loss.backward()

for p in model.parameters():
  print(p.grad)

#tensor([[ 1.5149,  0.3837,  2.2270, -0.2910, -1.7535],
        [-1.5149, -0.3837, -2.2270,  0.2910,  1.7535]])
#tensor([ 1.5102, -1.5102])

As we can see, the first layer has 5×2 weights and a bias vector of length 2.PyTorch creates the autograd graph with the operations as nodes. When we call loss.backward(), PyTorch traverses this graph in the reverse direction to compute the gradients and accumulate their values in the grad attribute of those tensors (the leaf nodes of the graph).

Optimizer.zero_grad()

Calling backward will lead derivatives to accumulate at leaf nodes. If backward was called earlier(as in any training loop), the loss is evaluated again, and the gradient at each leaf is accumulated (that is, summed) on top of the one computed at the previous iteration, which leads to an incorrect value for the gradient.

In order to prevent this from occurring, we need to zero the gradient explicitly at each iteration. We can do this easily using the in-place zero_ method:

optimizer.zero_grad()

Optimizer.step()

Every optimizer constructor takes a list of parameters as the first input. All parameters passed to the optimizer are retained inside the optimizer object so the optimizer can update their values and access their grad attribute:

lr = 1e-3
optimizer = optim.SGD(model.parameters(), lr=lr)

Each optimizer has two methods: zero_grad and step: 1.zero_grad zeroes the grad attribute of all the parameters passed to the optimizer upon construction. 2. step updates the value of those parameters according to the optimization strategy implemented by the specific optimizer.

Here’s the loop-ready code, with the extra zero_grad at the correct spot (right before the call to backward):

    optimizer.zero_grad()
    output = network(data)
    loss = F.nll_loss(output, target)
    loss.backward()
    optimizer.step()

During gradient descent, we need to adjust the parameters based on their gradients. PyTorch has abstracted away this functionality into the torch.optim module. This module provides functionality for determining the optimizer and updating the parameters of the model.

How loss.backward(), optimizer.step() and optimizer.zero_grad() related in PyTorch

Loss.backward()

Optimizer.zero_grad()

Optimizer.step()

Related Post

Latest Posts