There are several optimization strategies and tricks that can assist convergence, especially when models get complicated. The torch module has an optim submodule where we can find classes implementing different optimization algorithms. This saves us from the boilerplate busywork of having to update each and every parameter to our model ourselves. 

Every optimizer constructor takes a list of parameters like PyTorch tensors with requires_grad set to True as the first input. All parameters passed to the optimizer are retained inside the optimizer object so the optimizer can update their values and access their grad attribute.

PyTorch Optimizer

After a loss is computed from inputs, call loss.backward() leads to .grad being populated on parameters. At that point, the optimizer can access .grad and compute the parameter updates.

PyTorch loss .backward

Each optimizer exposes two methods zero_grad() and step().1. zero_grad() zeroes the grad attribute of all the parameters passed to the optimizer upon construction. 1.step() updates the value of those parameters according to the optimization strategy implemented by the specific optimizer. 

During gradient descent, we need to adjust the parameters based on their gradients. You can define the optimizer as follows:

import torch.optim
from torch import optim

in_dim, out_dim = 5, 2

inputs = torch.randn(in_dim)
target = torch.tensor([1,2],dtype=torch.float32)

model = torch.nn.Linear(in_dim, out_dim, bias=True)
out = model(inputs)

optimizer = optim.SGD(model.parameters(), lr=1e-3,weight_decay = 0.5)

Notice that when we create a PyTorch optimizer, we need to pass in model.parameters()

This code creates an optimizer that will update the parameters of the classifier via SGD at the end of each minibatch. To actually perform this update, you can use the following code:

loss = torch.nn.CrossEntropyLoss()
computed_loss = loss(out, target)

optimizer.zero_grad()# Zeroes out gradients between minibatches
computed_loss.backward()

optimizer.step() # Updates parameters via SGD

The value of params is updated upon calling step() without us having to touch it. What happens is that the optimizer looks into params.grad and updates params, subtracting learning_rate times grad from it.

scheduler.step()

The PyTorch neural network code library has many functions that can be used to adjust the learning rate during training. The simplest PyTorch learning rate scheduler is StepLR. Briefly, you create a StepLR object, then call its step() method to reduce the learning rate:

ptimizer = optim.SGD([torch.randn(1, requires_grad=True)], lr=100)
lr_scheduler = optim.lr_scheduler.StepLR(optimizer,step_size=4, gamma=.10)

for epoch in range(1, 25):
    lr_scheduler.step()
    print('Epoch {}, lr {:0.4f}'.format(epoch, optimizer.param_groups[0]['lr']))

The step_size=4 parameter means “adjust the LR after 4 time scheduler.step() is called”. The gamma=0.99 means “multiply the current LR by 0.99 when adjusting the LR”.

Epoch 1, lr 100.0000
Epoch 2, lr 100.0000
Epoch 3, lr 100.0000
Epoch 4, lr 10.0000
Epoch 5, lr 10.0000
Epoch 6, lr 10.0000
Epoch 7, lr 10.0000
Epoch 8, lr 1.0000
Epoch 9, lr 1.0000
Epoch 10, lr 1.0000
Epoch 11, lr 1.0000
Epoch 12, lr 0.1000
Epoch 13, lr 0.1000
Epoch 14, lr 0.1000
Epoch 15, lr 0.1000
Epoch 16, lr 0.0100
Epoch 17, lr 0.0100
Epoch 18, lr 0.0100
Epoch 19, lr 0.0100
Epoch 20, lr 0.0010
Epoch 21, lr 0.0010
Epoch 22, lr 0.0010
Epoch 23, lr 0.0010
Epoch 24, lr 0.0001

We have to call scheduler.step() every epoch If you don’t call it, the learning rate won’t be changed and stays at the initial value.You should call it after the optimizer.step() operation.

Related Post