How does batch normalization work?

Batch Normalization is a regularization function that has appeared recently. If you do any work in the neural network it’s something you can’t ignore. It’s super powerful. Let’s see how it works.

Why do you need Batch Normalization?

Let’s imagine we have the following type of data to work with a hypothetical example.

We have two values for predicting something out of this. First of all, this is very bad data “A” value is between 0 and 20, and the “B” value is between 0 and 2. They are not on the same scale.

This will be problematic in a neural network because when you put data in, it multiplies it by weights but those weights are basically fixed. Something that is in the 100 will produce much bigger activations that’s something that is on the order of magnitude of 1. The neural network will have to adapt its weights to compensate and work to do that.

Data Cleaning

We can clean this data a little bit. First, let’s rescale and reshift it to be well centered around 0 to correlate our data.

Those two curves seem to represent the same thing(roughly). You still see the signal in the differences between those lines. Decorrelate one signal using the average and the other signal using the difference between those.
I have two data sets that seem to be different enough not to tell me the same thing so they might be useful inputs for a neural network. What we just did is called principal component analysis.

Logits Layer

It is raw weighted sums+bias.There is the input into the activation functions.

Batch Normalization

Batch Normalization normalizes the activations but in a smart way to make sure that the ‘N’ inputs of the next layer are properly centered and scaled. Batch Normalization has three big ideas.

It works on batches so we have 100 images and labels in each batch on those batches. It is possible to compute statistics for the logits. Before applying the activation function to compute our statistics on this we can rescale and recenter those values by subtracting the average and dividing by the standard deviation. The +epsilon is for numerical stability to avoid doing divisions by 0.

We need to do something to preserve the full expressiveness of our model as at the same time that we do our statistical modifications to make the model work better. The genius idea in batch normalization is once you have rescaled your logit you scale them back using learnable parameters alpha and beta.

As previously you had biases in each neuron. You simply add two additional degrees of freedom per neuron call alpha and beta scalar and an offset factor and your batch normalized logits become the scaled and recenter logits X alpha + beta.

Batch normalization is added as simply a layer between your weighted sums and your activation function. The backpropagation and the gradient with the gradient computation will still work normally that’s the beauty of it.

Batch Normalization Sigmoid

After adding batch normalization and training again this is what we get.

Look at our activations it’s a perfectly beautiful Gaussian now the oldest bands are evenly distributed across the full useful range of our activation function. The Gaussian of inputs is centered on the linear part of the activation function.

Batch Normalization with ReLu

This is what I get with relu. I’m actually displaying the activations.

In ReLu zero in all the negative sides. On a mini-batch, a lot of values will be zero I don’t want those in my graph because that’s not part of the distribution that I want to display. What I’m displaying here is the maximum activation on one mini-batch.

The maximum value is taken by one neuron output on one given mini-batch and what I get is again a beautiful distribution across the useful portion of my ReLu instead of being completely skewed on one side. You can see that it’s’ a nice bell curve that is centered around the dense layers around 2.5 squarely in the positive range of the ReLu where the ReLu is useful.

Convolutional Batch Normalization

When you do convolutional layers with one neuron you have a given output per image in the batch but you’re also scanning the images using the same weights. It’s per image in the batch and per X position in the scanning and per Y position in the scanning.

That’s the only little change in the averages and standard deviations which you are computing previously on the whole batch in a convolutional layer you need to compute status to do statistics on all the images in the batch and all the positions in X and Y of your neurons. It’s just the computations of the stats that is different but you still have one scale factor and one offset factor per neuron just as you had one bias factor per neuron previously.