How ReLU works in convolutional neural network

Convolutional filters start at the upper left corner on top of every pixel in the input image and at every position, it’s going to dot the product and it will produce output which is called an activation map and fill it in the activation function.

Each of the filters is producing an activation map. Let’s take one filter, sliding it over all of the spatial locations in the image and then we’re going to get this activation map out which is the value of that filter at every spatial location.

The output of the layer is going to be the input to the next convolutional layer. ConvNet has a sequence of convolutional layers stacked on top of each other. And then we’re going to intersperse these with activation functions, for example, a ReLU activation function.

In the convolutional layer, we have the data coming in. We multiply by weight in the convolutional layer. Then we’ll pass this through an activation function for nonlinearity.

ReLU Activation Functions

ReLU was starting to be used a lot around 2012 when we had AlexNet, the first major convolutional neural network that was able to do well on ImageNet and large-scale data.

ReLU stands for Rectified Linear Unit and is represented by the function.

ReLU(x)=max(0,X)

It interspersed nonlinearity between many of the convolutional layers.
In a nutshell, ReLU is used for filtering information that propagates forward through the network.

It takes an elementwise operation on your input and basically, if your input is negative, it’s going to put it to zero. and then if it’s positive, it’s going to be just passed through its identity. This is one that’s pretty commonly used because it doesn’t saturate in the positive region.

The sigmoid was not zero-centered tanh fixed this and now ReLU has this problem again and that’s one of the issues of the ReLU. ReLU doesn’t activate for negative inputs, it’s possible to end up with “dead neurons” that never fire.

Advantages of ReLU over Sigmoid

1. Vanishing Gradient Problem

In the Sigmoid activation function, the gradient is typically a fraction between 0 and 1. In a multi-layer NN, these multiply and generate exponentially small gradients. So each step of gradient descent will make only a tiny change to the weights, leading to slow convergence. In contrast with ReLu activation, the gradient of the ReLu is either 0 or 1, so after many layers often the gradient will include the product of a bunch of 1’s, and thus the overall gradient is not too small or not too large.

2. Performance

max(0,X) runs much faster than the sigmoid function (1/(1+e^(-a)) which uses an exponent which is computationally slow when done often. This is true for both feed-forward and backpropagation as the gradient of ReLU (if a<0, =0 else =1) is also very easy to compute compared to the sigmoid.

3. Sparsity

The other benefit of ReLUs is sparsity. Sparse representations seem to be more beneficial than dense representations. Sparsity arises when X≤0. The more such units that exist in a layer the more sparse the resulting representation. Sigmoids on the other hand are always likely to generate some non-zero value resulting in dense representations.

Conclusion

The sigmoid activation function was quite popular when training neural networks 10 years ago but it has a vanishing gradients problem. The general recommendation is that you probably want to stick with ReLU for most cases or the default choice because it tends to work well for a lot of different architectures.