Your network can never rely on any given activation to be present because they might be squashed at any given moment. It is forced to learn a redundant representation for everything to make sure that at least some of the information remains. One activation gets smashed, but there’s always one or more that do the same job and that don’t kill.
So everything remains fine at the end. Forcing your network to learn redundant representations might sound very inefficient. But in practice, it makes things more robust and prevents overfitting. It also makes your network act as if taking the consensus over an ensemble of networks.
Dropout is another important technique for regularization. Imagine that you have one layer that connects to another layer. The values that go from one layer to the next are often called activations.
Take those activations for every example you train your network on and randomly set half of them to zero. Completely randomly, you basically take half of the data that’s flowing through your network and just destroy it and then randomly again.
The Dropout layer randomly sets input units to 0 with a frequency of rate at each step during training time, which helps prevent overfitting. In this post, you will discover how dropout regularization works at prediction and evaluation time.
Note that the Dropout layer only applies when training is set to True such that no values are dropped during inference. When using model.fit(), training will be appropriately set to True automatically.
trainable=False for a Dropout layer does not affect the layer’s behavior, as Dropout does not have any variables/weights that can be frozen during training.
The rescaling of the weights can be performed at training time instead, after each weight update at the end of the mini-batch. This is sometimes called “inverse dropout” and does not require any modification of weights during training. Both the Keras and PyTorch deep learning libraries implement dropout in this way.