Convolutional networks consist of stacks of convolution and max-pooling layers. The pooling layers let you spatially downsample the data, which is required to keep feature maps to a reasonable size as the number of features grows. 

Convnets often end with either a Flatten operation or a global pooling layer, turning spatial feature maps into vectors, followed by Linear layers to achieve classification or regression. 

One last type of pooling layer that you will often see in modern architecture is the global average pooling layer. It works very differently. Global average pooling computes the mean of each entire feature map. 

It’s like an average pooling layer using a pooling kernel with the same spatial dimensions as the inputs. This means that it just outputs a single number per feature map and per instance. 

This is extremely destructive most of the information in the feature map is lost, it can be useful just before the output layer.

The global average pooling layer outputs the mean of each feature map. This drops any remaining spatial information, which is fine because not much spatial information is left at that point. 

Create Model

Now, create a model with torch.nn module, the CNN model receives input images of size 3×64×64 (the images have three color channels). The input data goes through four convolutional layers to make 32, 64, 128, and 256 feature maps using filters with a kernel size of 3×3 and padding of 1 for the same padding. 

import torch
import torch.nn as nn

def simple_cnn(): 
  return nn.Sequential(
    nn.Conv2d(in_channels=3 ,out_channels=32 ,kernel_size=3, padding=1), 
    nn.ReLU(),
    nn.MaxPool2d(kernel_size=2),
    nn.Dropout(p=0.5),

    nn.Conv2d(in_channels=32,out_channels=64 ,kernel_size=3, padding=1), 
    nn.ReLU(),
    nn.MaxPool2d(kernel_size=2),
    nn.Dropout(p=0.5),

    nn.Conv2d(in_channels=64,out_channels=128 ,kernel_size=3, padding=1), 
    nn.ReLU(),
    nn.MaxPool2d(kernel_size=2),

    nn.Conv2d(in_channels=128,out_channels=256,kernel_size=3, padding=1), 
    nn.ReLU()
  )

Let’s see the shape of the output feature maps using a toy batch input.

x = torch.ones((4, 3, 64, 64))
model = simple_cnn()
model(x).shape #torch.Size([4, 256, 8, 8])

There are 256 feature maps of size 8×8. Now, we can add a fully connected layer to get to the output layer with a single unit. If we flatten the feature maps, the number of input units to this fully connected layer will be 8 × 8 × 256 = 16,384. 

Let’s use global average pooling, which computes the average of each feature map separately, thereby reducing the hidden units to 256. We can then add a fully connected layer.

To understand this, consider input feature maps of shape batch-size×8×64×64. The channels are numbered k =0, 1, …, 7. The global average-pooling operation calculates the average of each channel so that the output will have the shape [batchsize×8]. After this, we will squeeze the output of the global average-pooling layer. 

model.add_module('Global Average Pooling', nn.AvgPool2d(kernel_size=8))
model.add_module('flatten', nn.Flatten())
x = torch.ones((4, 3, 64, 64))
model(x).shape #torch.Size([4, 256])

The shape of the feature maps before this layer is [batchsize×256×8×8], we expect to get 256 units as output, that is, the shape of the output will be [batchsize×256]. 

Finally, we can add a fully connected layer to get a single output unit. In this case, we can specify the activation function to be sigmoid: 

 model.add_module('fc', nn.Linear(256, 1)) 
 model.add_module('sigmoid', nn.Sigmoid())

Related Post

Global Average Pooling in PyTorch using AdaptiveAvgPool

PyTorch Batch Normalization

Explain Pooling layers: Max Pooling, Average Pooling, Global Average Pooling, and Global Max pooling.

Pooling Layer vs Convolution Stride for Downsampling.

Max Pooling Layer vs Average Pooling Layer Which is Better?

Calculate Output Size of Convolutional and Pooling layers in CNN.