You can achieve good performance and faster training on the neural networks model by using a learning rate that changes during training. This is called adaptive learning rates. Here, this approach is called a learning rate schedule, where the default schedule uses a constant learning rate to update network weights for each training epoch.

In this post, we will discover how you can use different learning rate schedules for your neural network models in PyTorch, and decrease the learning rate gradually based on the epochtorch.optim.lr_scheduler provides several schedulers to adjust the learning rate based on the number of epochs.

Create a Dataset

For simplicity’s sake, this example creates a synthetic dataset that aims to form a linear relationship between two variables with some bias.

def generate_data(m, c, num_examples):
  
  X = torch.normal(0, 1, (num_examples, len(m)))
  y = torch.matmul(X, m) + c
  y += torch.normal(0, 0.01, y.shape)
  
  return X, y.reshape((-1, 1))
 
true_m = torch.tensor([2, -3.4])
true_c = 4.2
 
features, labels = generate_data(true_m, true_c, 1000)

Here, we use the built-in PyTorch function torch.normal to return a tensor of normally distributed random numbers. We’re also using the torch.matmul function to multiply tensor X with tensor m, and Y is distributed normally again.

batch_size = 10

dataset = data.TensorDataset(*(features, labels))
data_iter = data.DataLoader(dataset, batch_size, shuffle=True)

Create Model

For our case, a single-layer, feed-forward network with two inputs and one output layer is sufficient.

net = nn.Linear(2, 1)
net.weight.data.normal_(0, 0.01)
net.bias.data.fill_(0)

The model also requires the initialization of weights and biases. In the code, we initialize the weights using a Gaussian (normal) distribution with a mean value of 0, and a standard deviation value of 0.01. The bias is simply zero.

Optimizer

For optimization, we’ll implement a stochastic gradient descent method. The lr stands for learning rate and determines the update step during training.

loss = nn.MSELoss()
optimizer = torch.optim.SGD(net.parameters(), lr=0.1)

The learning rate for stochastic gradient descent has been set to a higher value of 0.1. The model is trained for 10 epochs, and the decay learning rate using the scheduler. It can be a good idea to use to using an adaptive learning rate. 

Exponential Learning Rate

When training a model, it is often useful to lower the learning rate as the training progresses. This schedule applies an exponential decay function to an optimizer step, given a provided initial learning rate.

This can be useful for changing the learning rate value across different invocations of optimizer functions. It is computed as:

scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.9)

You can use a learning rate schedule to modulate how the learning rate of your optimizer changes over time. Decays the learning rate of each parameter group by gamma every epoch

for epoch in range(num_epochs):
 
  for X, y in data_iter:
    
    optimizer.zero_grad() 
    l = loss(net(features), labels)
    l.backward() 
    optimizer.step() 
  
  scheduler.step()
  print(f'epoch {epoch + 1}' , "   lr  {:.4f}".format(scheduler.get_last_lr()[0]))

#Output
epoch 1    lr  0.0900
epoch 2    lr  0.0810
epoch 3    lr  0.0729
epoch 4    lr  0.0656
epoch 5    lr  0.0590
epoch 6    lr  0.0531
epoch 7    lr  0.0478
epoch 8    lr  0.0430
epoch 9    lr  0.0387
epoch 10    lr  0.0349

The learning rate schedule should be applied after the optimizer’s update. Here, the InitialLearningRate is the initial learning rate (such as 0.09), and the gamma is the amount that the learning rate is modified each time it is changed (such as 0.9).

Step Learning Rate

Another popular learning rate schedule used with deep learning models is systematically dropping the learning rate at specific times during training. Decays the learning rate by gamma every step_size epochs.

scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=4, gamma=0.9)

for epoch in range(num_epochs):
 
  for X, y in data_iter:
    
    optimizer.zero_grad() 
    l = loss(net(features), labels)
    l.backward() 
    optimizer.step() 
  
  scheduler.step()
  print(f'epoch {epoch + 1}' , "   lr  {:.4f}".format(scheduler.get_last_lr()[0]))

#output
epoch 1    lr  0.1000
epoch 2    lr  0.1000
epoch 3    lr  0.1000
epoch 4    lr  0.0900
epoch 5    lr  0.0900
epoch 6    lr  0.0900
epoch 7    lr  0.0900
epoch 8    lr  0.0810
epoch 9    lr  0.0810
epoch 10    lr  0.0810

Often this method is implemented by dropping the learning rate at every fixed number of epochs. For example, we may have an initial learning rate of 0.1 and drop it by 0.9 every four epochs. The first 4 epochs of training would use a value of 0.1, and in the next four epochs, a learning rate of 0.09 would be used, and so on.

Linear Learning Rate

Decays the learning rate of each parameter group by linearly changing small multiplicative factors until the number of epochs reaches a pre-defined milestone: total_iters.

scheduler = torch.optim.lr_scheduler.LinearLR(optimizer, start_factor=0.5, total_iters=4)

for epoch in range(num_epochs): 
  for X, y in data_iter:
    optimizer.zero_grad() 
    l = loss(net(features), labels)
    l.backward() 
    optimizer.step() 
  scheduler.step()
  print(f'epoch {epoch + 1}' , "   lr  {:.4f}".format(scheduler.get_last_lr()[0]))

  #Output

epoch 1    lr  0.0625
epoch 2    lr  0.0750
epoch 3    lr  0.0875
epoch 4    lr  0.1000
epoch 5    lr  0.1000
epoch 6    lr  0.1000
epoch 7    lr  0.1000
epoch 8    lr  0.1000
epoch 9    lr  0.1000
epoch 10    lr  0.1000

For example, use a learning rate that’s 0.0625 for the first epoch, 0.075 for the second epoch, 0.0875 for the third epoch, and 0.10 for any additional epochs.

Related Post