As you train the keras model longer it more specializes the weights to the training data or overfitting the training data. The weights will grow in size in order to handle the specifics of the examples seen in the training data.

Large weights make the network unstable. Although the weight will be specialized to the training dataset, minor variations or statistical noise on the expected inputs will result in large differences in the output.

A neural network with large network weights may be unstable where small changes in the input can lead to large changes in the output. This overfits the training dataset and will likely perform poorly when making predictions on new data.

You can update the learning algorithm to encourage the network to keep the weights small using weight regularization and it can be used as a general technique to reduce overfitting of the training dataset and improve the generalization of the model.

Regularizers allow you to apply penalties on weight during optimization. These penalties are summed into the loss function that the network optimizes.

You can apply penalties on a layer basis. Keras layers (e.g. Dense, Conv1D, Conv2D, and Conv3D) have a unified API. These layers expose 3 keyword arguments:

You have the regression equation y=Wx+b where x is the input, W the weights matrix, and b the bias.

1.kernel_regularizer: It applies a penalty on the layer’s kernel(weight) but excluding bias.

2.bias_regularizer: It applies a penalty only on the layer’s bias.

We typically use a parameter norm penalty that penalizes only the weights of each layer and leaves the biases unregularized. The biases typically require less data than the weights to fit accurately. Each weight specifies how two variables interact. Fitting the weight well requires observing both variables in a variety of conditions.

Each Bias controls only a single variable. This means that we do not induce too much variance by leaving the biases unregularized. Also, regularizing the bias parameters can introduce a significant amount of underfitting. 

3.activity_regularizer: It applies a penalty on the layer’s output. The following code applies  L1 and L2 vector norm penalties to the layer regularization.

from tensorflow.keras import layers
from tensorflow.keras import regularizers

layer = layers.Conv2D(
    filters=64,
    kernel_size=3,
    kernel_regularizer=regularizers.l1_l2(l1=1e-5, l2=1e-4),
    bias_regularizer=regularizers.l2(1e-4),
    activity_regularizer=regularizers.l2(1e-5)

The value returned by the activity_regularizer object gets divided by the input batch size so that the relative weighting between the weight regularizers and the activity regularizers does not change with the batch size.

Why Layer Regularization

It is sometimes desirable to use a separate penalty with a different coefficient for each layer of the network. Because it can be expensive to search for the correct value of multiple hyperparameters, it is still reasonable to use the same weight decay at all layers just to reduce the size of the search space.

If you have a high variance problem one of the first things is to try this regularization. The other way to address high variance is to get more training data that are also quite reliable. We can’t always get more training data, but adding regularization will often help to prevent overfitting or reduce variance in neural networks. 

If chosen carefully, these extra constraints and penalties can lead to improved performance on the test set. An effective regularizer is one that makes a profitabletrade, reducing variance significantly while not overly increasing the bias.

A model with large weights is more complex than a model with smaller weights. It is a sign of a network that may be overly specialized in training data. In practice, we prefer to choose the simpler models to solve a problem or we prefer models with smaller weights.

Related Post