Most of the networks used the Convolutional layers as feature extractors and then fed into fully connected layers, followed by an output layer. The problem with this approach is:

  • They are prone to overfitting and rely on regularizers like Dropout.
  • They sit like a black box between the categorical outputs and the spatial features extracted by the convolutional Layers.

Instead of adding fully connected layers on top of the feature maps, we take the average of each feature map, and the resulting vector is fed directly into the softmax layer. One advantage of this approach is that it is more native to the convolution structure by enforcing correspondences between feature maps and categories.

The feature maps can be easily interpreted as category confidence maps. Another advantage is that there is no parameter to optimize thus overfitting is avoided at this layer. Furthermore, It sums out the spatial information, thus it is more robust to spatial translations of the input.

Global average pooling means that you average each feature map separately. In our case, if the feature map is of dimension 4 x 4, you average each and obtain a single value. The important part here is that you do the average operation per channel. You can think of each of the feature maps as the final feature representation per category over which you want to do classification.

The approach we used was to ensure that there were enough stride convolutions such that the final layer would have a grid size of 1. Then we just flattened out the unit axes that we ended up with, to get a vector for each image, but that would cause two problems: 

  • We’d need lots of stride layers to make our grid 1×1.
  • The model would not work on images of any size other than the size we originally trained on.

One approach to dealing with the first issue would be to flatten the final convolutional layer in a way that handles a grid size other than 1×1. We could simply flatten a matrix into a vector, by laying out each row after the previous row. 

But there was another problem with this architecture: it did not work with images other than those of the same size used in the training set but also required a lot of memory because flattening out the convolutional layer resulted in many activations being fed into the final layers. Therefore, the weight matrices of the final layers were enormous. 

This problem was solved through the creation of fully convolutional networks. The trick in fully convolutional networks is to take the average of activations across a convolutional grid. In other words, we can simply use this function:

def avg_pool(x): 
      return x.mean((2,3))

As you see, it is taking the mean over the x- and y-axes. This function will always convert a grid of activations into a single activation per image.

PyTorch provides a slightly more versatile module called nn.AdaptiveAvgPool2d(), which averages a grid of activations into whatever sized destination you require.

You can use nn.AdaptiveAvgPool2d() to achieve global average pooling, just set the output size to (1, 1). Here we don’t specify the kernel_size, stride, or padding. Instead, we specify the output dimension i.e 1×1:

This is different from regular pooling in the sense that those layers will generally take the average for average pooling or the maximum for max-pooling of a window of a given size. For instance, max-pooling layers of size 2, which were very popular in older CNNs, reduce the size of our image by half on each dimension by taking the maximum of each 2×2 window with a stride of 2.

model = nn.Sequential(
            nn.Conv2d(3, 32, kernel_size=3, padding=1),
            nn.Conv2d(32, 64, kernel_size=3, stride=1, padding=1),
            nn.MaxPool2d(2, 2),

            nn.Conv2d(64, 128, kernel_size=3, stride=1, padding=1),
            nn.Conv2d(128, 128, kernel_size=3, stride=1, padding=1),
            nn.MaxPool2d(2, 2), # output: 128 x 8 x 8

            nn.Conv2d(128, 256, kernel_size=3, stride=1, padding=1),
            nn.Conv2d(256, 256, kernel_size=3, stride=1, padding=1),
            nn.MaxPool2d(2, 2), # output: 256 x 4 x 4
            nn.AdaptiveAvgPool2d((1,1)),# 256 x 1 x 1

            nn.Linear(256, 128),
            nn.Linear(128, 56),
            nn.Linear(56, 10))

Once we are done with our convolutional layers, we will get activations of size bs x ch x h x w (batch size x channels x height, x width). We want to convert this to a tensor of size bs x ch, so we take the average over the last two dimensions and flatten the trailing 1×1 dimension as we did in our previous model. 

Then we just flattened out the unit axes that we ended up with, to get a vector for each image so, a matrix of activations for a mini-batch makes our grid 1×1 at the end.

Related Post

Run this code in Google colab.