Rounding error is problematic when it compounds across many operations and can cause models that work in theory but fail in practice if they are not designed to minimize the accumulation of rounding error.

One form of rounding error is underflow, it occurs when numbers near zero are rounded to zero. Many functions behave qualitatively differently when their argument is zero rather than a small positive number. For example, we usually want to avoid division by zero or taking the logarithm of zero.

Another form of numerical error is overflow, it occurs when numbers with large magnitude are approximated as ∞ or −∞. Further arithmetic will usually change these infinite values into not-a-number values.

The softmax function stabilized against underflow and overflow. The softmax function is often used to predict the probabilities associated with a multinoulli distribution. The softmax function is defined to be:

Softmax Activation function

The softmax function has multiple output values, these output values can be saturated when the differences between input values become extreme. When the softmax saturates, many cost functions based on the softmax also saturate, unless they are able to invert the saturating activating function.

The softmax function represents a probability distribution over a discrete variable with n possible values, Softmax functions are most often used as the output of a classifier, to represent the probability distribution over n different classes.

model = nn.Sequential(
            nn.Linear(4072, 512),
            nn.Tanh(),
            nn.Linear(512, 2),
            nn.Softmax(dim=1))

Many objective functions other than the log-likelihood do not work as well with the softmax function. The softmax function fails to learn when the argument to the exp becomes very negative, causing the gradient to vanish. In particular, the squared error is a poor loss function for softmax units and can fail to train the model to change its output, even when the model makes highly confident incorrect predictions.

Consider what happens when all of the Xi is equal to some constant c, then all of the outputs should be equal to 1/n. Numerically, this may not occur when c has a large magnitude. If c is very negative, then exp(c) will underflow. This means the denominator of the softmax will become 0, so the final result is undefined. 

When c is very large and positive, exp(c) will overflow, again resulting in the expression as a whole being undefined. 

Both of these difficulties can be resolved by the log softmax function, which calculates log softmax in a numerically stable way. The log softmax function stabilized the softmax function.

The log softmax function is simply a logarithm of a softmax function. The use of log probabilities means representing probabilities on a logarithmic scale, instead of the standard [0,1] interval.

The use of log probabilities improves numerical stability, when the probabilities are very small, because of the way in which computers. Taking the product of a high number of probabilities is often faster if they are represented in log form. 

model = nn.Sequential(
            nn.Linear(4072, 512),
            nn.Tanh(),
            nn.Linear(512, 2),
            nn.LogSoftmax(dim=1))

loss = nn.NLLLoss()

PyTorch has an nn.NLLLoss class. It does not take probabilities but rather takes a tensor of log probabilities as input. It then computes the NLL of our model given the batch of data. 

CrossEntropyLoss

The combination of nn.LogSoftmax and nn.NLLLoss is equivalent to using nn.CrossEntropyLoss . This terminology is a particularity of PyTorch. It is quite common to drop the last nn.LogSoftmax layer from the network and use nn.CrossEntropyLoss as a loss.

model =  nn.Sequential(
          nn.Linear(1024, 512),
          nn.Tanh(),
          nn.Linear(512, 128),
          nn.Tanh(),
          nn.Linear(128, 2))


loss_fn = nn.CrossEntropyLoss()

Taking the logarithm of a probability is tricky when the probability gets close to zero. The workaround is to use log probability instead of probability, which takes care to make the calculation numerically stable. The reformulated version allows us to evaluate softmax with only small numerical errors even when z contains extremely large or extremely negative numbers.

Related Post