Very complicated weighting structures often lead to overfitting because this is simply memorizing the training inputs and not allowing it to learn abstract and generalize the problem. Regularization is a very important technique for machine learning and neural networks. It reduces the complexity of a machine learning model by reducing the complexity of the weights in the neural network. This leads to a reduction in overfitting.
In this tutorial, we’ll discuss what regularization is and when and why it may be helpful to add it to our model. We’re going to look at L1 and L2 regularizations and how these are used to combat overfitting in the neural network in various ways.
In general, regularization is a technique that helps reduce overfitting or reduce variance in our network by penalizing for complexity. The idea is that certain complexities in our model may make our model unlikely to generalize well even though it fits the training data. If we add regularization to the model we’re essentially trading in some of the ability of our model to fit the training data as well as the ability to have the model generalize better to data it hasn’t seen before.
Implement of regularization is to simply, add a term to our loss function that penalizes for large weights. The most common regularization technique is called L1/L2 regularization.
L1 regularization is the sum of the absolute values of all weights in the model.
Here, We are calculating a sum of the absolute values of all of the weights. These weights can be positive or negative they can be through a whole wide range of things. We sum up all the weights and multiply them by a value called alpha which you have to tell it “how big of an effect you want the L1 to have alpha”.
It tries to shrink error as much as possible if you’re adding the sum of the weights onto that error it’s going to shrink those weights because that’s just an additive property of the weights so it tries to shrink the weights down. We’re taking the absolute values of that because if we didn’t take the absolute value and try to push all the weights to negative numbers and that would really be bad.
The most popular regularization is L2 regularization, which is the sum of squares of all weights in the model.
Let’s break down L2 regularization. We have our loss function, now we add the sum of the squared norms from our weight matrices and multiply this by a constant. This constant here is going to be denoted by lambda. This lambda here is called the regularization parameter and this is another hyperparameter that we’ll have to choose and then test in tune in order to assign the correct number for our specific model.
If we set lambda to be a relatively large number then it would incentivize the model to set the weight close to 0 because the objective of SGD is to minimize the loss function and remember our original loss function is now being summed with the sum of the squared matrix norms.
If lambda is large then this would continue to stay relatively large and if we’re multiplying that by this sum then that product may be relatively large depending on how large our weights are. then our model is incentivized to make these weights small so that the value of the overall function stays relatively small in order to meet the objective of minimizing the loss intuitively.
So now that we have a general idea about regularization let’s see how we can add it to our model. So we’re going to start looking at how l1 and l2 are implemented in a simple PyTorch model.
In PyTorch, we could implement regularization pretty easily by adding a term to the loss. After computing the loss, whatever the loss function is, we can iterate the parameters of the model, sum their respective square (for L2) or abs (for L1), and backpropagate:
print('\nEpoch : %d'%epoch)
for data in tqdm(trainloader):
#Replaces pow(2.0) with abs() for L1 regularization
l2_lambda = 0.001
l2_norm = sum(p.pow(2.0).sum()
for p in model.parameters())
loss = loss + l2_lambda * l2_norm
running_loss += loss.item()
_, predicted = outputs.max(1)
total += labels.size(0)
correct += predicted.eq(labels).sum().item()
print('Train Loss: %.3f | Accuracy: %.3f'%(train_loss,accu))
Adding L2 regularization to the loss function is equivalent to decreasing each weight by an amount proportional to its current value during the optimization step.L2 regularization is also referred to as weight decay.
The SGD optimizer in PyTorch already has a weight_decay parameter that corresponds to 2 * lambda, and it directly performs weight decay during the update as described previously. It is fully equivalent to adding the L2 norm of weights to the loss, without the need for accumulating terms in the loss and involving autograd. Note that weight decay applies to all parameters of the network, such as biases.