The modern deep learning system uses a non-saturated activation function like ReLU, Leaky ReLU to replace its saturated counterpart of Sigmoid or Tanh. It solves the “exploding/vanishing gradient” problem and accelerates the convergence speed.
ReLU prunes the negative part to zero and retains the positive part. It has a desirable property that the activations are sparse after passing ReLU.
The gradient-based optimization algorithm will not adjust the weights of a unit that never activates initially. ReLU has a disadvantage during optimization because the gradient is 0 whenever the unit is not active.
In the ReLU, you can end up with a neural network that never learns if the neurons are not activated at the start. The learning is slow when training ReLU networks with constant 0 gradients.
Leaky ReLU introduces some small negative slope to the ReLU to sustain and keep the weight updates alive during the entire propagation process.
The alpha parameter was introduced as a solution to the ReLUs dead neuron problems such that the gradients will not be zero at any time during training.
The leaky ReLU function is nearly identical to the standard ReLU function. The Leaky ReLU sacrifices hard-zero sparsity for a gradient that is potentially more robust during optimization. Alpha is a fixed parameter(float >= 0.).
The Leaky ReLU has a non-zero gradient over its entire domain, unlike the standard ReLU function.
ValueError: Unknown activation function:LeakyReLU
The leaky ReLU activation function is available as layers, and not as activations; therefore, you should use it as such:
Sometimes you don’t want to add extra activation layers for this purpose, you can use the activation function argument as a callable object.
model.add(layers.Conv2D(64, (3, 3), activation=tf.keras.layers.LeakyReLU(alpha=0.2)))
Since a Layer is also a callable object, you could also simply use it.
In ReLU the negative part is totally dropped, while in Leaky ReLU assigns a non-zero slope to it. The Leaky ReLU has the ability to retain some degree of the negative values that flow into it. This extended output range slightly higher flexibility to the model. Incorporating a nonzero slope for a negative part in Leaky ReLU improves the results.
- What is the Dying ReLU problem in Neural Networks?
- How ReLU works in convolutional neural network?
- Difference between nn.ReLU() and nn.ReLU(inplace=True).
- Advantages of ReLU vs Tanh vs Sigmoid activation function in deep neural networks.
- Activation function for Output Layer in Regression, Binary, Multi-Class, and Multi-Label Classification