In any particular layer, we have the data coming in then we multiply by weight and then we’ll pass this through an activation function for nonlinearity, today in this tutorial, we’ll talk more about the difference between the Sigmoid, Tanh, and ReLU activation function and tread off between them.
Sigmoid activation function
The sigmoid function takes each number into the nonlinearity function and the elementwise squashes these into the ranges[0,1].
If you get very high values as input, then the output is going to be something near one. If you get very negative values, it’s going to be near zero and it’s in a linear regime. It looks a bit like a linear function. This is been historically popular because you can interpret them as a kind of saturating firing rate of a neuron.
If it’s something between zero and one, you could think of it as a firing rate. If we look at this nonlinearity more carefully, there are several problems with this.
The saturated neurons can kill off the gradient. If we look at a sigmoid gate in our computational graph, we have our data X as input into it, and then we have the output of the sigmoid gate coming out of it.
What does the gradient look like When X is equal to -10 or very negative?
The gradient becomes zero, that’s because, this is a negative, very negative region of the sigmoid, it’s essentially flat, so the gradient is zero. We multiply by something near zero, and we’re going to get a very small gradient that’s flowing back downwards.
In a sense, after the chain rule, this kills the gradient flow and you’re going to have a zero gradient passed downstream nodes.
What happens when X is equal to zero?
It’s fine in this regime. In this regime near zero, you’re going to get a reasonable gradient, and then it’ll be fine for a backdrop.
What about X equals 10 or very positive?
When X is equal to a very negative or X is equal to large positive numbers, then these are all regions where the sigmoid function is flat, and it’s to kill off the gradient and you’re not going to get a gradient flow coming back.
The second problem is that the sigmoid output is not zero-centered. Let’s take a look at why this is a problem. Consider what happens when the input to a neuron is always positive. So in this case, all of our Xs we’re going to be positive. It’s going to be multiplied by some weight ‘W’ and then run through the activation function. What does this mean if all of X is positive? It’s always going to be positive. It’s always going to be either positive or all positive or all negative. It’s some arbitrary gradient coming down and then our local gradient that we multiply this by is if we’re going to find the gradients on W.
If we say that we can only have all positive or all negative updates, then we have these two quadrants and the two places where the axis is either all positive or negative, and these are the only directions in which we’re allowed to make a gradient update.
In general, we want zero-mean data. We want our input x to be zero meaned so that we actually have positive and negative values and we don’t get into this problem of the gradient updates. They’ll be all moving in the same direction.
We’ve talked about these two main problems of the sigmoid. The saturated neurons can kill gradients if we’re too positive or too negative of an input. They’re also not zero-centered and so we get these, this inefficient kind of gradient update. The third problem is an exponential function. This is a little bit computationally expensive. In the grand scheme of your network, this is usually not the main problem, because we have all these convolutions and dot products that are a lot more expensive, but this is just a minor point also to observe.
TanH activation function
Now we can look at a second activation function tanh(x), this looks very similar to the sigmoid but the difference is that it’s squashing to the range [-1,1].
The main difference is that it’s now zero-centered, so we’ve gotten rid of the second problem that we had. It still kills the gradients, however when it’s saturated. You still have these regimes where the gradient is essentially flat and you’re going to kill the gradient flow. This is a bit better than the Sigmoid, but it still has some problems.
ReLU activation function
This function is f(x)=max(0,x). It takes an elementwise operation on your input and if your input is negative, it’s going to put it to zero and then if it’s positive, it’s going to be just passed through.
This is also computationally very efficient. The sigmoid has exponential in it and the ReLU is just simple max() and there it’s extremely fast. In practice using this ReLU it converges much faster than the sigmoid and the tanh, about six-time faster.
ReLU was starting to be used a lot around 2012 when we had AlexNet, the first major convolutional neural network that was able to do well on ImageNet and large-scale data.
There is a still problem with the ReLu, it’s not zero-centered anymore. We saw that the sigmoid was not zero-centered tanh fixed this and now ReLU has this problem again.
When x equals negative 10 then the gradient is zero, when X is equal to positive 10 then we are in the linear regime. When x is equal to zero then it is undefined, but in practice, it’s Zero, basically, it’s killing the gradient in half of the regime. We can get this phenomenon of basically dead ReLu when we’re in this bad part of the regime. You can read more about dead ReLU here.