Convolutional filters start at the upper left corner on top of every pixel in input image and at every position, it’s going to dot product and it will produce output which is called activation map and fill it in activation function.
Let’s take one filter, sliding it over all of the spatial locations in the image and then we’re going to get this activation map out which is the value of that filter at every spatial location.
ConvNet has a sequence of convolutional layers stacked on top of each other. And then we’re going to intersperse these with activation functions, for example, a ReLU activation function.
In the convolutional layer, we have the data coming in.We multiply by weight in the convolutional layer. Then we’ll pass this through an activation function for nonlinearity.
ReLU Activation Functions
ReLU was starting to be used a lot around 2012 when we had AlexNet, the first major convolutional neural network that was able to do well on ImageNet and large-scale data.
ReLU stands for Rectified Linear Unit, and is represented by the function.
It interspersed nonlinearity between many of the convolutional layers.
In a nutshell, ReLU is used for filtering information that propagates forward through the network.
It takes an elementwise operation on your input and basically if your input is negative, it’s going to put it to zero.and then if it’s positive, it’s going to be just passed through its identity. This is one that’s pretty commonly used because it doesn’t saturate in the positive region.
The sigmoid was not zero-centered tanh fixed this and now ReLU has this problem again and that’s one of the issues of the ReLU. ReLU doesn’t activate for negative inputs, it’s possible to end up with “dead neurons” that never fire.
Advantages of ReLU over Sigmoid
1.Vanishing Gradient Problem
In the Sigmoid activation function, the gradient is typically fraction between 0 and 1. In a multi-layer NN, these multiply and generate exponentially small gradients. So each step of gradient descent will make only a tiny change to the weights, leading to slow convergence. In contrast with ReLu activation, the gradient of the ReLu is either 0 or 1, so after many layers often the gradient will include the product of a bunch of 1’s, and thus the overall gradient is not too small or not too large.
max(0,X) runs much faster than sigmoid function (1/(1+e^(-a)) which uses an exponent which is computational slow when done often. This is true for both feed forward and back propagation as the gradient of ReLU (if a<0, =0 else =1) is also very easy to compute compared to sigmoid.
The other benefit of ReLUs is sparsity. Sparse representations seem to be more beneficial than dense representations.Sparsity arises when X≤0. The more such units that exist in a layer the more sparse the resulting representation. Sigmoids on the other hand are always likely to generate some non-zero value resulting in dense representations.
Sigmoid activation function was quite popular when train neural networks 10 years ago but it has vanishing gradients problem. The general recommendation is that you probably want to stick with ReLU for most cases or default choice because it tends to work well for a lot of different architectures.