PyTorch tensors can remember where they come from, in terms of the operations and parent tensors that originated them. They can automatically provide the chain of derivatives of such operations with respect to their inputs.

PyTorch will automatically provide the gradient of that expression with respect to its input parameters.

PyTorch Grad

The “requires_grad=True” argument tells PyTorch to track the entire family tree of tensors resulting from operations on params. Any tensor that will have params as an ancestor will have access to the chain of functions that we’re called to get from params to that tensor. The value of the derivative will be automatically populated as a grad attribute of the params tensor. All PyTorch tensors have an attribute named grad.


At this point, the best way to understand proceed optimizer is to create a simple PyTorch model and train it.

class LinearRegressionModel(nn.Module):
    def __init__(self):
        super(LinearRegressionModel, self).__init__()
        self.linear = nn.Linear(1, 1,bias=False)  

    def forward(self, x):
        out = self.linear(x)
        return out


Optimizers save us from the boilerplate code of having to update each and every parameter to our model ourselves. The torch module has an optim submodule where we can find classes implementing different optimization algorithms.

Every optimizer constructor takes a list of parameters as the first input. All parameters passed to the optimizer are retained inside the optimizer object so the optimizer can update their values and access their grad attribute.

model = LinearRegressionModel()

criterion = nn.MSELoss()

optimizer = torch.optim.SGD(model.parameters(), lr=0.01)


When we compute our loss at time PyTorch creates the autograd graph with the operations as nodes. When we call loss.backward(), PyTorch traverses this graph in the reverse direction to compute the gradients. At this point, the grad attribute of params contains the derivatives of the loss with respect to each element of params.

PyTorch would compute the derivatives of the loss throughout the computation graph and accumulate their values in the grad attribute of those the leaf nodes of the graph.

Why do we need to call zero_grad?

Calling backward on loss will lead derivatives to accumulate at leaf nodes. If backward is called again as in any training loop then the loss is evaluated again, and the gradient at each leaf is accumulated (summed) on top of the one computed at the previous iteration, which leads to an incorrect value for the gradient.

In order to prevent this from occurring, we need to zero the gradient explicitly at each iteration. We can do this easily using the zero_grad method.

inputs = torch.from_numpy(x_train).requires_grad_()
labels = torch.from_numpy(y_train)


outputs = model(inputs)

loss = criterion(outputs, labels)


print("before zero grad : ",model.linear.weight.grad) #tensor([[-166.2116]])


print("After zero grad : ",model.linear.weight.grad) #tensor([[0.]])

Let’s see what our training loop looks like, start to finish:

for epoch in range(50):
    # Convert numpy array to torch Variable
    inputs = torch.from_numpy(x_train).requires_grad_()
    labels = torch.from_numpy(y_train)

    # Clear gradients w.r.t. parameters
    # Forward to get output
    outputs = model(inputs)
    # Calculate Loss
    loss = criterion(outputs, labels)
    # Getting gradients w.r.t. parameters
    # Updating parameters
    print('epoch {}, loss {}'.format(epoch, loss.item()))

Each optimizer exposes two methods: zero_grad() and step(), zero_grad() zeroes the grad attribute of all the parameters passed to the optimizer upon construction. step() updates the value of that parameter.