Rounding error is problematic when it compounds across many operations and can cause models that work in theory but fail in practice if they are not designed to minimize the accumulation of rounding error.

One form of rounding error is **underﬂow**, it occurs when numbers near zero are rounded to zero. Many functions behave qualitatively diﬀerently when their argument is zero rather than a small positive number. For example, we usually want to avoid division by zero or taking the logarithm of zero.

Another form of numerical error is **overﬂow**, it occurs when numbers with large magnitude are approximated as ∞ or −∞. Further arithmetic will usually change these inﬁnite values into not-a-number values.

The softmax function stabilized against underﬂow and overﬂow. The softmax function is often used to predict the probabilities associated with a multinoulli distribution. The softmax function is deﬁned to be:

The softmax function has multiple output values, these output values can be saturated when the diﬀerences between input values become extreme. When the softmax saturates, many cost functions based on the softmax also saturate, unless they are able to invert the saturating activating function.

The softmax function represents a probability distribution over a discrete variable with n possible values, Softmax functions are most often used as the output of a classiﬁer, to represent the probability distribution over n diﬀerent classes.

```
model = nn.Sequential(
nn.Linear(4072, 512),
nn.Tanh(),
nn.Linear(512, 2),
nn.Softmax(dim=1))
```

Many objective functions other than the **log-likelihood** do not work as well with the softmax function. The softmax function fails to learn when the argument to the exp becomes very negative, causing the gradient to vanish. In particular, the squared error is a poor loss function for softmax units and can fail to train the model to change its output, even when the model makes highly conﬁdent incorrect predictions.

Consider what happens when all of the **X _{i}** is equal to some constant

`c`

, then all of the outputs should be equal to `1/n`

. Numerically, this may not occur when c has a large magnitude. If `c`

is very negative, then `exp(c)`

will underﬂow. This means the denominator of the softmax will become 0, so the ﬁnal result is undeﬁned. When c is very large and positive, `exp(c)`

will overﬂow, again resulting in the expression as a whole being undeﬁned.

Both of these diﬃculties can be resolved by the log softmax function, which calculates log softmax in a numerically stable way. The log softmax function stabilized the softmax function.

The log softmax function is simply a logarithm of a softmax function. The use of log probabilities means representing probabilities on a logarithmic scale, instead of the standard [0,1] interval.

The use of log probabilities improves numerical stability, when the probabilities are very small, because of the way in which computers. Taking the product of a high number of probabilities is often faster if they are represented in log form.

```
model = nn.Sequential(
nn.Linear(4072, 512),
nn.Tanh(),
nn.Linear(512, 2),
nn.LogSoftmax(dim=1))
loss = nn.NLLLoss()
```

PyTorch has an `nn.NLLLoss`

class. It does not take probabilities but rather takes a tensor of log probabilities as input. It then computes the NLL of our model given the batch of data.

## CrossEntropyLoss

The combination of `nn.LogSoftmax`

and `nn.NLLLoss`

is equivalent to using `nn.CrossEntropyLoss`

. This terminology is a particularity of PyTorch. It is quite common to drop the last `nn.LogSoftmax`

layer from the network and use `nn.CrossEntropyLoss`

as a loss.

```
model = nn.Sequential(
nn.Linear(1024, 512),
nn.Tanh(),
nn.Linear(512, 128),
nn.Tanh(),
nn.Linear(128, 2))
loss_fn = nn.CrossEntropyLoss()
```

Taking the logarithm of a probability is tricky when the probability gets close to zero. The workaround is to use log probability instead of probability, which takes care to make the calculation numerically stable. The reformulated version allows us to evaluate softmax with only small numerical errors even when z contains extremely large or extremely negative numbers.