Given input data and the corresponding desired outputs, as well as initial values for the weights, the model is fed input data, and a measure of the error is evaluated by comparing the resulting outputs to the ground truth. 

In order to optimize the parameter(weight and bias) of the model, the value of the weights is updated in the direction that leads to a decrease in the error. The procedure is repeated until the error, evaluated on unseen data, falls below an acceptable level.

Gradient descent

Gradient descent is a very simple idea, the idea is to compute the rate of change of the loss with respect to each parameter, and modify each parameter in the direction of decreasing loss. Just like when we can estimate the rate of change by adding a small number to w and b and seeing how much the loss changes.

PyTorch tensors can remember where they come from, in terms of the operations and parent tensors that originated them. They can automatically provide the chain of derivatives of such operations with respect to their inputs. PyTorch will automatically provide the gradient of that expression with respect to its input parameters and compute the gradient automatically.

Let’s initialize a parameters tensor:

import torch

x = torch.tensor(3.)
w = torch.tensor(4., requires_grad=True)
b = torch.tensor(5., requires_grad=True)

We’ve already created our data tensors, so now let’s write out the model as a Python function:

y = w * x + b

We’re expecting w, and b to be the input tensor, weight parameter, and bias parameter, respectively. In our model, the parameters will be PyTorch scalars (aka zero-dimensional tensors), and the product operation will use broadcasting to yield the returned tensors. We can now check the value of the y:

print(y)  #tensor(17., grad_fn=<AddBackward0>)

Print gradient values

The requires_grad=True argument tells PyTorch to track the entire family tree of tensors resulting from operations on params. In other words, any tensor that will have params as an ancestor will have access to the chain of functions that were called to get from params to that tensor.

In case these functions are differentiable, the value of the derivative will be automatically populated as a grad attribute of the params tensor. In general, all PyTorch tensors have an attribute named grad. Usually, it’s None:

print(w.grad) #None

Accumulate Gradient

All we have to do to populate it is to start with a tensor with requires_grad set to True, then call backward() on the y tensor:

print(w.grad,b.grad) #tensor(3.) tensor(1.)

At this point, the grad attribute of params contains the derivatives of the y with respect to each element of params.

When we compute our y while the parameters w and b require gradients, in addition to performing the actual computation, PyTorch creates the autograd graph with the operations (in circles) as nodes. When we call y.backward(), PyTorch traverses this graph in the reverse direction to compute the gradients, as shown by the arrows in the bottom row of the figure.

PyTorch compute gradient

In this case, PyTorch would compute the derivatives of the loss throughout the chain of functions (the computation graph) and accumulate their values in the grad attribute of those tensors, the leaf nodes of the graph.

Calling backward will lead derivatives to accumulate at leaf nodes. We need to zero the gradient explicitly after using it for parameter updates. We can do this easily using the in-place zero_ method:

if w.grad is not None:
    w.grad.zero_()

Print gradient of layers

Once you’ve called loss.backward() to calculate the gradients, you can directly print them like this:

loss.backward()
print(model[0].weight.grad)

Related Post