Standardizing our input features to a mean of zero and variance of one puts the parameters at a similar scale. Training neural networks become difficult as the distribution of each layer’s inputs changes while training.

The network training converges faster if its weights are linearly transformed to have zero means and unit variances. It’s not always necessary that zero mean and unit variance for the hidden layer values is best, but there are chances that any other distribution might be better too.

Batch normalization makes sure that the values of hidden units have standardized mean and variance. Using batch normalization, you can consistently accelerate and converge very deep networks possible to train networks with over 100 layers.

The data begins to pass through layers, the values will begin to shift as the layer transformations are performed. Normalizing the outputs from a layer ensures that the scale stays in a specific range as the data flows through the network from input to output.

We add batch normalization to our network for normalizing the data again after it has passed through one or more layers.

The BatchNorm layer calculates the mean and standard deviation with respect to the batch at the time normalization is applied. This is opposed to the entire dataset with dataset normalization.

torch.nn.BatchNorm2d(num_features, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, device=None, dtype=None)

To see how batch normalization works we will build a neural network using Pytorch and test it on the MNIST data set. Using torch.nn.BatchNorm2d , we can implement batch normalization. It takes input as num_features which is equal to the number of out-channels of the layer above it.

class Net(nn.Module):
  def __init__(self):
    super(Net,self).__init__()
    self.conv1=nn.Conv2d(1,32,3,1)
    self.conv1_bn=nn.BatchNorm2d(32)
    
    self.conv2=nn.Conv2d(32,64,3,1)
    self.conv2_bn=nn.BatchNorm2d(64)
    
    self.dropout1=nn.Dropout(0.25)
    
    self.fc1=nn.Linear(9216,128)
    self.fc1_bn=nn.BatchNorm1d(128)
    
    self.fc2=nn.Linear(128,10)
  def forward(self,x):
    x=self.conv1(x)
    x=F.relu(self.conv1_bn(x))
    
    x=self.conv2(x)
    x=F.relu(self.conv2_bn(x))
    
    x=F.max_pool2d(x,2)
    x=self.dropout1(x)
    
    x=torch.flatten(x,1)
    
    x=self.fc1(x)
    x=F.relu(self.fc1_bn(x))
    
    x=self.fc2(x)
    output=F.log_softmax(x,dim=1)
    return output

BatchNorm layer uses to distribute the data uniformly across a mean that the network sees best, before squashing it by the activation function. Without the BatchNorm, the activations could over or undershoot, depending on the squashing function though.

The BatchNorm layer is usually added before ReLU as mentioned in the Batch Normalization paper. But there is no real standard being followed as to where to add a BatchNorm layer. You can experiment with different settings and you may find different performances for each setting.

Related Post

Run this code in Google Colab.