There are several optimization strategies and tricks that can assist convergence, especially when models get complicated. The torch module has an optim submodule where we can find classes implementing different optimization algorithms. This saves us from the boilerplate busywork of having to **update each and every parameter** to our model ourselves.

Every optimizer constructor takes a list of parameters like PyTorch tensors with `requires_grad`

set to True as the first input. All parameters passed to the optimizer are **retained inside the optimizer object** so the optimizer can update their values and access their grad attribute.

After a loss is computed from inputs, call `loss.backward()`

leads to `.grad`

being populated on parameters. At that point, the optimizer can access `.grad`

and compute the parameter updates.

Each optimizer exposes two methods `zero_grad()`

and `step()`

.1. `zero_grad()`

zeroes the grad attribute of all the parameters passed to the optimizer upon construction. 1.`step()`

updates the value of those parameters according to the optimization strategy implemented by the specific optimizer.

During gradient descent, we need to adjust the parameters based on their gradients. You can define the optimizer as follows:

```
import torch.optim
from torch import optim
in_dim, out_dim = 5, 2
inputs = torch.randn(in_dim)
target = torch.tensor([1,2],dtype=torch.float32)
model = torch.nn.Linear(in_dim, out_dim, bias=True)
out = model(inputs)
optimizer = optim.SGD(model.parameters(), lr=1e-3,weight_decay = 0.5)
```

Notice that when we create a PyTorch optimizer, we need to pass in `model.parameters()`

.

This code creates an optimizer that will update the parameters of the classifier via SGD at the end of each minibatch. To actually perform this update, you can use the following code:

```
loss = torch.nn.CrossEntropyLoss()
computed_loss = loss(out, target)
optimizer.zero_grad()# Zeroes out gradients between minibatches
computed_loss.backward()
optimizer.step() # Updates parameters via SGD
```

The value of params is updated upon calling **step()** without us having to touch it. What happens is that the optimizer looks into `params.grad`

and updates params, subtracting `learning_rate`

times `grad`

from it.

## scheduler.step()

The PyTorch neural network code library has many functions that can be used to adjust the learning rate during training. The simplest PyTorch learning rate scheduler is StepLR. Briefly, you create a StepLR object, then call its step() method to reduce the learning rate:

```
ptimizer = optim.SGD([torch.randn(1, requires_grad=True)], lr=100)
lr_scheduler = optim.lr_scheduler.StepLR(optimizer,step_size=4, gamma=.10)
for epoch in range(1, 25):
lr_scheduler.step()
print('Epoch {}, lr {:0.4f}'.format(epoch, optimizer.param_groups[0]['lr']))
```

The step_size=4 parameter means “adjust the LR after 4 time scheduler.step() is called”. The gamma=0.99 means “multiply the current LR by 0.99 when adjusting the LR”.

```
Epoch 1, lr 100.0000
Epoch 2, lr 100.0000
Epoch 3, lr 100.0000
Epoch 4, lr 10.0000
Epoch 5, lr 10.0000
Epoch 6, lr 10.0000
Epoch 7, lr 10.0000
Epoch 8, lr 1.0000
Epoch 9, lr 1.0000
Epoch 10, lr 1.0000
Epoch 11, lr 1.0000
Epoch 12, lr 0.1000
Epoch 13, lr 0.1000
Epoch 14, lr 0.1000
Epoch 15, lr 0.1000
Epoch 16, lr 0.0100
Epoch 17, lr 0.0100
Epoch 18, lr 0.0100
Epoch 19, lr 0.0100
Epoch 20, lr 0.0010
Epoch 21, lr 0.0010
Epoch 22, lr 0.0010
Epoch 23, lr 0.0010
Epoch 24, lr 0.0001
```

We have to call scheduler.step() every epoch If you don’t call it, the learning rate won’t be changed and stays at the initial value.You should **call it after the optimizer.step() operation**.