We know that the objective of the training model is to minimize the loss between the actual output and the predicted output from our given training samples. The path towards this minimize loss is occurring over several steps. The size of these steps was taken to reach minimize is going to depend on the learning rate of the model.

After the loss is calculated for our given inputs the gradient of that loss has been calculated with respect to each of the weights in model. The gradients will then get multiplied by the learning rate. This learning rate is a small number usually ranging between the point at 0.1 to .0001 but the actual value can vary.

In this tutorial, we’ll be discussing why and how to change the learning rate during the training.

Setting the learning rate too high is a risk to the possibility of overshooting. This occurs when we take a step that’s too large in the direction of the minimize loss function. To avoid this we can set the learning rate to a number on the lower side of this range with this option our steps will be really small it will take a lot longer to reach the point of minimizing loss.

One of the things that might help speed up your learning algorithm is to slowly reduce your learning rate over time, this call learning rate decay.

Learning Rate Decay

If you were to slowly reduce the learning rate then during the initial phases your learning rate still large. You can still have it to be fast learning but then as the learning rate gets smaller your steps you take would be slower and smaller and so you end up oscillating in a tighter region around this minimum rather than one ring far away even as training goes on and on.

The intuition behind slowly reducing the learning rate is that maybe during the initial steps of learning you can afford to take much bigger steps but then as learning approaches convergence then having a slower learning rate allows you to take smaller steps.

This can be done by using pre-defined learning rate schedules or adaptive learning rate methods. In this article, we train a convolutional neural network on CIFAR-10 using differing learning rate schedules and adaptive learning rate methods.

Learning Rate Schedules

Learning rate schedules adjust the learning rate during training by pre-defined schedule. Common learning rate schedules include exponential decay, step decay, and time-based decay. For illustrative purposes, trained on CIFAR-10, using stochastic gradient descent (SGD) optimization algorithm with different learning rate schedules to compare the performances.

It requires a step value to compute the decayed learning rate. You can just pass a variable that you increment at each training step. That produces a decayed learning rate when passed the current optimizer step. This can be useful for changing the learning rate value across different invocations of optimizer functions. 

Step Base Decay

The step base decay schedule drops the learning rate by a factor every few epochs. A typical way is to drop the learning rate by half every 5 or 10 epochs.

Learning Rate Decay using Step base Decay

To implement this in Keras, we can define a step decay function and use LearningRateScheduler callback to take the step decay function as an argument and return the updated learning rates for use in the SGD optimizer.

def step_decay(epoch):
   initial_lrate = 0.1
   drop = 0.2
   epochs_drop = 2.0
   lrate = initial_lrate * math.pow(drop,  
   return lrate


lrate = tf.keras.callbacks.LearningRateScheduler(step_decay)


history_step=model.fit(x_train, y_train,
              validation_data=(x_val, y_val),

We can use callbacks to get a view on internal states and statistics of the model during training. In our example, we create a custom callback by extending the base class keras.callbacks.Callback to record the learning rate during the training procedure.

class LearningRate(tf.keras.callbacks.Callback):
  def on_train_begin(self,logs={}):

  def on_epoch_end(self, batch, logs={}):

Exponential Decay

This schedule applies an exponential decay function to an optimizer step, given a provided initial learning rate.

initial_learning_rate = 0.1

lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay(


You can pass this schedule directly into a tf.keras.optimizers.Optimizer as the learning rate.

Adaptive Learning Rate

In Keras, we can implement adaptive learning algorithms easily using pre-define optimizers like Adagrad, Adadelta, RMSprop, Adam. It is usually recommended to leave the hyperparameters of these optimizers at their default values.


Finally, we compare the performances of learning rate schedules and adaptive learning rate methods.

Comparing Learning Rate


The value we chose for the learning rate is going to require some testing. The learning rate is another one of those hyperparameters that we have to test and tune with each model before we know exactly where we want to set it. Note that the decay rate here it become another hyperparameter that you might need to tune.

Run this code in Google Colab