Overgeneralizing is something that we humans do all too often. Machines can fall into the overgeneralizing if we are not careful. It means that the model performs well on the training data, but it does not generalize well. 

Deep neural networks typically have thousands of parameters, sometimes even millions. This gives them an incredible amount of freedom and means they can fit a huge variety of complex datasets. But this great flexibility also makes the network prone to overfitting the training set.

If the model has more parameters then, it has more prone to overfitting the training data, so we will look at how to detect whether or not this is the case, using learning curves, and then we will look at several regularization techniques that can reduce the risk of overfitting the training set.

Overfitting Training Data

The figure shows an example of a classification model that strongly overfits the training data. Even though it performs much better on the training data than the validation, would you really trust it?

plot train and validation loss

There is a gap between the curves. This means that the model performs significantly better on the training data than on the validation data, which is the hallmark of an overfitting model.

If the validation curves are close to the training curves, which means that there is not too much overfitting.

If you used a much larger training set, however, the two curves would continue to get closer. One way to improve an overfitting model is to feed it more training data until the validation error reaches the training error.

Overfitting also happens when the model is too complex relative to the amount and noisiness of the training data. 

Underfitting the Training Data

As you might guess, underfitting is the opposite of overfitting: it occurs when your model is too simple to learn the underlying structure of the data. Reality is just more complex than the model, so its predictions are bound to be inaccurate, even in the training examples. The main options to fix this problem are:

  • Selecting a more powerful model, with more parameters
  • Feeding better features to the learning algorithm
  • Reducing the constraints on the model (e.g., reducing the regularization hyperparameter)

If your model is underfitting the training data, adding more training examples will not help. You need to use a more complex model or come up with better features.

Lastly, your model needs to be neither too simple (in which case it will underfit) nor too complex (in which case it will overfit). 

How can you decide how complex your model should be? How can you tell that your model is overfitting or underfitting the data?

Detect Overfitting

You can use cross-validation to estimate a model’s generalization performance. If a model performs well on the training data but generalizes poorly according to the cross-validation metrics, then your model is overfitting. If it per‐ forms poorly on both, then it is underfitting. This is one way to tell when a model is too simple or too complex.

Another way to tell is to look at the learning curves, these are plots of the model’s performance on the training set and the validation set.

Prevent Overfitting Using Regularization

Regularization technique whose objective is to reduce overfitting and improve the model’s ability to generalize.

Dropout Layer

Note that Dense layers often have a lot of parameters. For example, the first hidden layer has 784 × 300 connection weights, plus 300 bias terms, which adds up to 235,500 parameters! This gives the model quite a lot of flexibility to fit the training data, but it also means that the model runs the risk of overfitting, especially when you do not have a lot of training data.

It is a fairly simple algorithm: at every training step, every neuron (including the input neurons, but always excluding the output neurons) has a probability p of being temporarily “dropped out,” meaning it will be entirely ignored during this training step, but it may be active during the next step.

Since dropout is only active during training, comparing the training loss and the validation loss can be misleading. In particular, a model may be overfitting the training set and yet have similar training and validation losses. So make sure to evaluate the training loss without dropout (e.g., after training).

Max-Norm Regularization

Another regularization technique that is popular for neural networks is called max-norm regularization. It does not add a regularization loss term to the overall loss function. Instead, it is typically implemented by computing ∥w∥2 after each training step and rescaling w if needed (w ← w r/‖ w ‖2). Reducing r increases the amount of regularization and helps reduce overfitting.

Pooling Layers

Their goal is to shrink the input image in order to reduce the computational load, memory usage, and the number of parameters thereby limiting the risk of overfitting.

Data Augmentation

Data augmentation artificially increases the size of the training set by generating many realistic variants of each training instance. This reduces overfitting, making this a regularization technique.

Prevent Overfitting in Keras

Moreover, if you use a validation set during training, you can set save_best_only=True when creating the ModelCheckpoint. In this case, it will only save your model when its performance on the validation set is the best so far. This way, you do not need to worry about training for too long and overfitting the training Set:

ckpoint = keras.callbacks.ModelCheckpoint("model.h5", save_best_only=True)

EarlyStopping Callback

Another way to use the EarlyStopping callback. It will interrupt training when it measures no progress on the validation set for a number of epochs (defined by the patience argument), and it will optionally roll back to the best model. 

You can combine both callbacks to save checkpoints of your model in case your computer crashes and interrupt training early when there is no more progress to avoid wasting time and resources.

Conclusion

If you observe that the model is overfitting, you can increase the dropout rate. Conversely, you should try decreasing the dropout rate if the model underfits the training set. It can also help to increase the dropout rate for large layers, and reduce it for small ones. Moreover, many state-of-the-art architectures only use dropout after the last hidden layer, so you may want to try this if full dropout is too strong.

For more complex problems, you can ramp up the number of hidden layers until you start overfitting the training set.

Just like the number of layers, you can try increasing the number of neurons gradually until the network starts overfitting. But in practice, it’s often simpler and more efficient to pick a model with more layers and neurons than you actually need, then use early stopping and other regularization techniques to prevent it from overfitting.