The conventional method is to perform a grid or a random search, which can be computationally expensive and time-consuming. In addition, the effects of these hyper-parameters are tightly coupled with each other, the data, and the architecture. In this tutorial, we discuss learning rate and weight decay hyper-parameters. Choosing learning rate and weight decay hyper-parameters will improve the network’s performance.

Learning ate

The model parameters are initialized randomly and get tweaked repeatedly to minimize the cost function. The learning step size is proportional to the slope of the cost function, so the steps gradually get smaller as the parameters approach the minimum. It is determined by the learning rate hyperparameter. 

Learning Rate

If the learning rate (LR) is too small, overfitting can occur. Large learning rates help to regularize the training but if the learning rate is too large, the training will diverge.

Learning rate too small
The too-small learning rate

On the other hand, if the learning rate is too high, you might jump across the valley and end up on the other side, possibly even higher up than you were before. 

Learning rate large
The too-large learning rate

The steps start out large which helps make quick progress and escape local minima, then get smaller and smaller, allowing the algorithm to settle at the global minimum. 

Learning Schedule

The function that determines the learning rate at each iteration is called the learning schedule. If the learning rate is reduced too quickly, you may get stuck in a local minimum, or even end up frozen halfway to the minimum. If the learning rate is reduced too slowly, you may jump around the minimum for a long time and end up with a suboptimal solution if you halt training too early.

If you are not satisfied with the performance of your model, you should go back and tune the hyperparameters. The first one to check is the learning rate. If that doesn’t help, try another optimizer and always retune the learning rate after changing any hyperparameter. If the performance is still not great, then try tuning model hyper-parameters such as the number of layers, the number of neurons per layer, and the types of activation functions to use for each hidden layer batch size weight decay.

Weight Decay

Weight decay is one form of regularization and it plays an important role in training so its value needs to be set properly. The important point that is, practitioners must balance the various forms of regularization to obtain good performance. 

Weight decay is a regularization technique applied to the weights of a neural network. We minimize a loss function compromising both the primary loss function and a penalty on the Norm of the weights. Weight decay can be incorporated directly into the weight update rule, rather than just implicitly by defining it through to an objective function.

Weight decay is not like learning rates, the best value should remain constant throughout the training. Since the network’s performance is dependent on a proper weight decay value, a grid search is worthwhile and differences are visible early in the training. That is, the validation loss early in the training is sufficient for determining a good value. 

If you have no idea of a reasonable value for weight decay, test 103, 104, 105, and 0. Smaller datasets and architectures seem to require larger values for weight decay while larger datasets and deeper architectures seem to require smaller values.

The value of weight decay is a key knob to turn for tuning regularization against the regularization from an increasing learning rate. While other forms of regularization are generally fixed (i.e.dropout ratio, stochastic depth), one can easily change the weight decay value when experimenting with maximum learning rate and stepsize values.

A general principle is: the amount of regularization must be balanced for each dataset and architecture. Recognition of this principle permits the general use of super-convergence. Reducing other forms of regularization and regularizing with very large learning rates makes training significantly more efficient.