Batch Normalization is a regularization function that has appeared recently. If you do any work in the neural network it’s something you can’t ignore. It’s super powerful. Let’s see how it works.

## Why you need Batch Normalization

Let’s imagine we have the following type of data to work with a hypothetical example. We have two values for predicting something out of this. First of all you this is very bad data “A” value is between 0 and 20 and the “B” value is 0 and 2. They are not on the same scale. This will be problematic in a neural network because when you put data in, it multiplying it by weights but those weights are basically fixed. Something that is in the 100 will produce much bigger activations that’s something that is on the order of magnitude of 1. The neural network will have to adapt its weights to compensate and work to do that.

## Data Cleaning

We can clean this data a little bit. First, let’s rescale and reshift it to be well centered around 0 to correlated our data. Those two curves seem to represent the same thing(roughly). You still see the signal in the differences between those lines. Decorrelate one signal using the average and the other signal using the difference between those. I have two data sets that seem to be different enough not telling me the same thing so they might be useful inputs for a neural network. What we just did is called principal component analysis.

#### Logits Layer

It is raw weighted sums+bias.There is the input into the activation functions.

## Batch Normalization

Batch Normalization normalizes the activations but in a smart way to make sure that the ‘N’ inputs of the next layer are properly centered scaled. Batch Normalization has three big ideas.

It works on batches so we have 100 images and labels in each batch on those batches. It is possibles to compute statistics for the logits. Before applying the activation function to compute our statistics on this we can rescale and recenter those values by subtracting the average and dividing by the standard deviation. The +epsilon is for numerical stability to avoid doing divisions by 0.

We need to do something to preserve the full expressiveness of our model as at the same time that we do our statistical modifications to make the model work better. The genius idea in batch normalization is once you have rescaled your logit you scale them back using to learnable parameters alpha and beta. As previously you had biases in each neuron. You simply add to two additional degrees of freedom per neuron call alpha and beta scalar and an offset factor and your batch normalized logits become the scaled and recenter logits X alpha + beta. ## Batch Normalization Sigmoid

After adding batch normalization and train again this is what we get. Look our activations it’s a perfectly beautiful Gaussian now the oldest bands are evenly distributed across the full useful range of our activation function.
The Gaussian is of inputs is centered on the linear part of the activation function.

## Batch Normalization with ReLu

This is what I get with relu.I’m actually displaying the activations. In ReLu zero in all the negative side. On a mini-batch, a lot of values will be zero I don’t want those in my graph because that’s not part of the distribution that I want to display. What I’m displaying here is the maximum activation on one mini batch. The maximum value taken by one neuron output on one given mini batch and what I get is again a beautiful distribution across the useful portion of my ReLu instead of being completely skewed on one side. You can see that it’s’ a nice bell curve that is centered around the dense layers around 2.5 squarely in the positive range of the ReLu where the ReLu is useful.

## Convolutional Batch Normalization

When you do convolutional layers one neuron you have a given output per image in the batch but you’re also scanning the images using the same weights. It’s per images in the batch and per Xposition in the scanning and per Y position in the scanning. That’s the only little change the averages and standard deviations which you are computing previously on the whole batch in a convolutional layer you need to compute status to do statistics on all the images in the batch and all the positions in X and Y of your neurons. It’s just the computations of the stats that is different but you still have one scale factor and one offset factor per neuron just as you had one bias factor per neuron previously.