ReLU activation function has many advantages over other activation functions like convergence happened six times faster using a ReLU activation function as compared to using a sigmoid or a Tanh activation function. It’s computationally very cheap.
If it is greater than zero just take the value and move on if it is less than zero sets it to zero and move on.
But ReLU has one problem which known as a dying neuron or a dead neuron problem if the input to a ReLU neuron is negative the output would be zero. Let’s understand how the gradient would flow in this kind of a situation.
Previous layer’s inputs with X1, X2 and Weights W1, W2. a is the pre-activation of that neuron then apply ReLU activation and you get the output y.
If the input a is negative it straight away become 0 and none of these weights would get any contribution through the ReLU activation which means those weights W1, W2 may never get updated they stay the same and this cycle could continue.
If W1 and W2 are negative which means in every cycle there will be no gradient update and the neuron never participates in the neural network’s output this is known as the dead neuron problem. It has observed through empirical studies that in, Neural network that has a ReLU activation in all its layers many neurons end up suffering from the dying neuron problem.
How to overcome this problem?
- You could initialize your bias weight to a positive value. In case W1X1 +W1X2 turns out to be negative having a positive maybe push a towards the non-negative side and this could help ReLU get activated and the gradients start flowing through this particular neuron.
- Other options are to use Leaky ReLU.You should be able to appreciate why Leaky ReLU was useful if you read the previous tutorial.
The shape of Leaky ReLU is very similar to ReLU on the positive side, for the negative side instead of making it 0 Just take a small value and that’s why it was max(αx,x) where alpha could be a small value such as 0.1. This lets the output of the ReLU not become zero which may kill the gradient but let a small gradient at least pass through the neuron so it doesn’t stop contributing to the neural network least contributes marginally. The exponential linear unit (ELU) also has the same effect.