Given input data and the corresponding desired outputs, as well as initial values for the weights, the model is fed input data, and a measure of the error is evaluated by comparing the resulting outputs to the ground truth.

In order to optimize the parameter(weight and bias) of the model, the value of the weights is updated in the direction that leads to a decrease in the error. The procedure is repeated until the error, evaluated on unseen data, falls below an acceptable level.

## Gradient descent

Gradient descent is a very simple idea, the idea is to **compute the rate of change** of the loss with respect to each parameter, and modify each parameter in the direction of decreasing loss. Just like when we can estimate the rate of change by adding a small number to w and b and seeing how much the loss changes.

PyTorch tensors can remember where they come from, in terms of the operations and parent tensors that originated them. They can automatically provide the chain of derivatives of such operations with respect to their inputs. PyTorch will automatically provide the gradient of that expression with respect to its input parameters and compute the gradient automatically.

Let’s initialize a parameters tensor:

```
import torch
x = torch.tensor(3.)
w = torch.tensor(4., requires_grad=True)
b = torch.tensor(5., requires_grad=True)
```

We’ve already created our data tensors, so now let’s write out the model as a Python function:

```
y = w * x + b
```

We’re expecting `w`

, and `b`

to be the input tensor, weight parameter, and bias parameter, respectively. In our model, the parameters will be PyTorch scalars (aka zero-dimensional tensors), and the product operation will use broadcasting to yield the returned `tensors`

. We can now check the value of the y:

```
print(y) #tensor(17., grad_fn=<AddBackward0>)
```

## Print gradient values

The `requires_grad=True`

argument tells PyTorch to track the entire family tree of tensors resulting from operations on params. In other words, any tensor that will have params as an ancestor will have access to the chain of functions that were called to get from params to that tensor.

In case these functions are differentiable, the value of the derivative will be automatically populated as a grad attribute of the params tensor. In general, all PyTorch tensors have an attribute named grad. Usually, it’s None:

```
print(w.grad) #None
```

## Accumulate Gradient

All we have to do to populate it is to start with a tensor with `requires_grad`

set to `True`

, then call `backward()`

on the `y`

tensor:

```
print(w.grad,b.grad) #tensor(3.) tensor(1.)
```

At this point, the grad attribute of params contains the derivatives of the `y`

with respect to each element of params.

When we compute our `y`

while the parameters` w`

and `b`

require gradients, in addition to performing the actual computation, PyTorch creates the autograd graph with the operations (in circles) as nodes. When we call `y.backward()`

, PyTorch traverses this graph in the reverse direction to compute the gradients, as shown by the arrows in the bottom row of the figure.

In this case, PyTorch would compute the derivatives of the `loss`

throughout the chain of functions (the computation graph) and accumulate their values in the **grad** attribute of those tensors, the leaf nodes of the graph.

Calling backward will lead derivatives to accumulate at leaf nodes. We need to zero the gradient explicitly after using it for parameter updates. We can do this easily using the in-place zero_ method:

```
if w.grad is not None:
w.grad.zero_()
```

## Print gradient of layers

Once you’ve called `loss.backward()`

to calculate the gradients, you can directly print them like this:

```
loss.backward()
print(model[0].weight.grad)
```