Where should place Dropout, Batch Normalization, and Activation Layer?

Dropout is a simple way to prevent neural networks from overfitting. It has been proven to be significantly effective in many machine learning areas, such as image classification, speech recognition, and even natural language processing.

Before Batch Normalization, it became a necessity of all the state-of-the-art networks and successfully boosted their performances against overfitting risks, despite its amazing simplicity.

Dropout can be interpreted as a way of regularizing a neural network by adding noise to its hidden units. Specifically, it involves multiplying hidden activations by Bernoulli distributed random variables which take the value 1 with probability p (0 ≤ p ≤ 1) and 0 otherwise.

Importantly, the test scheme is quite different from training. During training, the information flow goes through the dynamic sub-network. In the test phase, the Dropout retention ratio scales the neural responses.

Batch Normalization

Batch Normalization is a powerful skill that not only sped up all networks but also improved upon their strong baselines by acting as regularizers. Therefore, Batch Normalization has been adopted in nearly all the recent network structures and demonstrates its great practicability and effectiveness.

Batch Normalization normalizes each neuron into zero mean and unit variance. The normalization of activations that depends on the mini-batch allows efficient training but is neither necessary nor desirable during inference.

Batch Normalization vs Dropout

Dropout and Batch Normalization often lead to a worse performance when they are combined together in many modern neural networks, but cooperate well sometimes as in Wide ResNet (WRN).

Dropout shifts the variance of a specific neural unit when we transfer the state of that network from training to test. However, Batch Normalization maintains its statistical variance, which is accumulated from the entire learning procedure, in the test phase. The inconsistency of variances in Dropout and Bath Normalization causes unstable numerical behavior in the inference that leads to erroneous predictions finally.

Order of Batch Normalization and Dropout

Dropout tries to keep the same mean of the outputs, but it does change the standard deviation, which will cause a huge difference in the Batch Normalization between training and validation.

During training, the Batch Normalization receives changed standard deviations, accumulates, and stores them. During validation, the dropouts are turned off, the standard deviation is not a changed one anymore, but the original. But Batch Normalization, because it’s in validation, will not use the batch statistics, but the stored statistics, which will be very different from the batch statistics.

Order of Batch Normalization and Dropout

The first and most important rule is, don’t place a Batch Normalization after a Dropout. As the paper by Ioffe and Szegedy mentioned Batch Normalization directly after a Convolutional layer but before the activation function.

Dropout for Convolutional Layer

Convolutional layer output is a set of feature maps. In 2D image processing, these feature maps are grayscale images that correspond to common shapes in the image set.

Applying Dropout after Convolutional layers does not do what you would expect, because the values in feature maps are strongly correlated. Dropout is generally used after a Dense.

In the original paper that proposed dropout layers, by Hinton (2012), dropout (with p=0.5) was used on each of the fully connected (dense) layers before the output; it was not used on the convolutional layers.

Order of Batch Normalization and Activation

In the Ioffe and Szegedy 2015, “We would like to ensure that for any parameter values, the network always produces activations with the desired distribution”. So the Batch Normalization Layer is actually inserted right after a Conv Layer/Fully Connected Layer, but before feeding into ReLu (or any other kinds of) activation.

Order of Dropout and Activation

With ‘relu‘, there is no difference, the results are exactly the same. With activations that are not centered, such as ‘sigmoid’ putting a dropout before the activation will not result in “zeros”, but in other values. For a sigmoid, the final result of the dropout before it would be 0.5.

If you add a ‘tanh‘ after a dropout, for instance, you will have the zeros, but the scaling that the dropout applies to keep the same mean will be distorted by the tanh. So in summary, the order of using batch normalization and dropout is:

Order of Batch Normalization Activation Dropout

Conclusion

Dropout and Batch Normalization is two powerful methods that always fail to obtain an extra reward when combined together practically. In fact, a modern network even performs worse and unsatisfactorily when it is equipped with Batch Normalization and Dropout simultaneously in their bottleneck blocks.

Batch Normalization eliminates the need for Dropout in some cases, and thus conjectured that Bath Normalization provides similar regularization benefits as Dropout intuitively. More evidence is provided in recent architectures such as ResNet/PreResNet, ResNeXt, DenseNet, where the best performances are all obtained by BN with the absence of Dropout.