Convolutional neural networks are stacking one convolution after the other and at the same time downsampling the image between successive convolutions. Downsampling could in principle occur in different ways.
Scaling an image by half is the equivalent of taking four neighboring pixels as input and producing one pixel as output. How we compute the value of the output based on the values of the input is up to us.
Currently, the max pooling layer is the most commonly used approach, but it has the downside of discarding the other three-quarters of the data. In max pooling, take non-overlapping 2 x 2 tiles and take the maximum over each of them as the new pixel at the reduced scale.
By keeping the highest value in the 2 × 2 neighborhood as the downsampled output, we ensure that the features that are found survive the downsampling, at the expense of the weaker responses.
The output images from a convolution layer tend to have a high magnitude where certain features corresponding to the estimated kernel are detected (such as vertical lines).
The goal of the pooling layer is to subsample (i.e., shrink) the input image in order to reduce the computational load, the memory usage, and the number of parameters (thereby limiting the risk of overfitting).
Average pooling was a common approach early on but has fallen out of favor somewhat. As you might expect, it works exactly like a max pooling layer, except it computes the mean rather than the max.
When to use max pooling and average pooling?
Max pooling has some downsides. It’s very destructive with a tiny 2 × 2 kernel and a stride of 2, the output will be two times smaller in both directions (so its area will be four times smaller), simply dropping 75% of the input values and in some applications, invariance is not desirable.
Take the semantic segmentation example, if the input image is translated by one pixel to the right, the output should also be translated by one pixel to the right.
The goal in this case is equivariance, not invariance: a small change to the inputs should lead to a corresponding small change in the output.
Average pooling layers used to be very popular, but people mostly use max pooling layers now, as they generally perform better. This may seem surprising since computing the mean generally loses less information than computing the max.
On the other hand, max pooling preserves only the strongest features, getting rid of all the meaningless ones, so the next layers get a cleaner signal to work with. Moreover, max pooling offers stronger translation invariance than average pooling, and it requires slightly less computation.
Note that max pooling and average pooling can be performed along the depth dimension instead of the spatial dimensions, although it’s not as common. This can allow the CNN to learn to be invariant to various features.
For example, it could learn multiple filters, each detecting a different rotation of the same pattern and the depthwise max pooling layer would ensure that the output is the same regardless of the rotation. The CNN could similarly learn to be invariant to anything: thickness, brightness, skew, color, and so on.